Exploratory Data Analysis and Network Analysis¶
The following notebook continues the work of a thesis centered around the quantitative exploration of the literary genre Gothic Fiction with the help of a number of tools established within the distant reading community. A community of literary scholars and digital humanists set on approaching, analyzing and interpreting texts from afar with computational means.
This section deals with exploratory data analysis on the topic distribution of the LDA model created in the past section, which was joined with all the textual features of the original corpus. First, with the help of pyLDAvis the most salient and relevant terms of the topics are used to provide distinct labels for them. Further, the underlying features of the genre are analyzed by investigating how its topics are distributed and how they shift throughout time. Clustering is employed on the distribution of the topics to gain insight into their makeup.
In the last section, the distribution of topics is used as a measure of similarity to determine potential routes of influence throughout the corpus with the use of network analysis.
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.lines as mlines
from collections import Counter
from joblib import load, dump
from ipywidgets import widgets
import plotly.graph_objects as go
import plotly.express as px
from dash import html, dcc
from dash.dependencies import Input, Output
import matplotlib.gridspec as gridspec
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.stats import entropy
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from wordcloud import WordCloud
import community as community_louvain
from adjustText import adjust_text
import seaborn as sns
import pandas as pd
import numpy as np
import random
import networkx as nx
import pyLDAvis
import dash
import string
import time
import os
import re
The dataframes imported from the previous notebook consist of a document topic distribution with each document being one 5000-word segment of a book and features about the texts covering the following attributes: 'title', 'author', 'date','gender', 'birthdate', 'nationality', 'source' always being given, as well as the following only filled for about 1/4 of the texts 'period', 'mode', 'genre', 'role' and 'polarity'
Additionally a number of features relevant for the topic explorations offered by pyLDAvis are imported as well.
df_txt_features_LDA=pd.read_csv('./analysis/df_txt_features_LDA.csv')
df_txt_features_LDA=pd.read_csv('./analysis/df_txt_features_LDA.csv')
df_txt_features_CTM=pd.read_csv('./analysis/df_txt_features_CTM.csv')
df_txt_features_ETM=pd.read_csv('./analysis/df_txt_features_ETM.csv')
top_words_per_topic_LDA = ('./analysis/top_words_per_topic_LDA.joblib')
top_words_per_topic_CTM = ('./analysis/top_words_per_topic_CTM.joblib')
top_words_per_topic_ETM = ('./analysis/top_words_per_topic_ETM.joblib')
topic_term_dists_LDA = load('./analysis/topic_term_dists_LDA.joblib')
doc_topic_dists_LDA = load('./analysis/doc_topic_dists_LDA.joblib')
topic_term_dists_CTM = load('./analysis/topic_term_dists_CTM.joblib')
doc_topic_dists_CTM = load('./analysis/doc_topic_dists_CTM.joblib')
topic_term_dists_ETM = load('./analysis/topic_term_dists_ETM.joblib')
doc_topic_dists_ETM = load('./analysis/doc_topic_dists_ETM.joblib')
vocab = load('./analysis/vocab.joblib')
doc_lengths= load('./analysis/doc_lengths.joblib')
term_frequency = load('./analysis/term_frequency.joblib')
Exploring the feature distribution of the corpus in general¶
Plotting the general distribution of values for all relevant categories.
df_feat = df_txt_features_LDA.copy()
df_feat.fillna({'period': 'Unknown', 'mode': 'Unknown', 'genre': 'Unknown', 'role': 'Unknown'}, inplace=True)
nr_texts=df_feat.text_key.nunique()
nr_segments=df_feat.reference.nunique()
nr_authors=df_feat.author.nunique()
print(f"The corpus contains {nr_texts} unique texts, {nr_segments} unique segments, and {nr_authors} unique authors.")
# New calculations
total_entries = len(df_feat)
source_counts = df_feat['source'].value_counts()
source_percentages = (source_counts / total_entries) * 100
# Constructing the new sentence
source_sentence = "The corpus sources include: " + ", ".join([f"{source} with {count} entries ({percentage:.2f}%)"
for source, count, percentage in zip(source_counts.index, source_counts.values, source_percentages)])
print(source_sentence)
The corpus contains 181 unique texts, 221 unique segments, and 89 unique authors. The corpus sources include: colors with 110 entries (49.77%), pb-manual with 47 entries (21.27%), pb-under with 47 entries (21.27%), gutenberg with 17 entries (7.69%)
# Categorical features
categorical_features = ['title', 'author', 'gender', 'nationality', 'source', 'period', 'mode', 'genre', 'role']
for feature in categorical_features:
plt.figure(figsize=(10, 6))
# Ordering the categories by frequency
order = df_feat[feature].value_counts().index
sns.countplot(y=feature, data=df_feat, order=order)
plt.title(f'Distribution of {feature}')
plt.show()
# Numerical features
numerical_features = ['date', 'birthdate']
for feature in numerical_features:
plt.figure(figsize=(10, 6))
sns.histplot(df_feat[feature], kde=True)
plt.title(f'Distribution of {feature}')
plt.show()
50% of the documents in the corpus are taken from the color corpus, another 20% are drawn from the lists of Underwood, another, 20% from the author lists of punter and botting and 10% make up the shelf of Project Gutenberg and are not covered by any other source. Around half of the texts are by British authors, with another 20% of Scottish, Irish or Welsh texts, and American texts make up not quite 30% of the distribution. Other English-speaking sources rarely occur.
Two-thirds of the documents have a male author. The general distribution of publishing dates reflects waves of literary production, reflecting Moretti's assertion of two peeks in the production of the genre (Graphs, Maps, Trees, p,15), at 1800 and 1830, but adds a third peek to it around 1900. A slow fade out in the early 20th century was chosen to prevent further blurring and muddying of genre boundaries around the advent of weird fiction at the beginning of the 20th century.
Information on the period, text type, and role within the larger canon is only provided by the color corpus. Two-thirds of the labeled texts fall within the label of Romantic, and roughly one-third within the Victorian period, which reflects the distribution of publishing dates, while the former covers the two peaks of the late 18th to the early 19th century, the latter category accounts for the peak at around 1900.
Around half of the labeled texts are novels, with another quarter of short stories and novellas, while poetry, drama, and other forms are underrepresented. It is to be expected that the segment of short stories is underreported, given the inclusion of a large segment of short story collections and the propensity of some of the major contributors to the corpus, like Poe, Machen, and Blackwood to write exclusively in short fiction formats.
# Most prevalent titles and authors
top_authors = df_feat['author'].value_counts().nlargest(20).index
top_titles = df_feat['title'].value_counts().nlargest(20).index
# Plotting distributions for 'author'
plt.figure(figsize=(10, 6))
author_order = df_feat['author'].value_counts().iloc[:20].index
sns.countplot(y='author', data=df_feat, order=author_order)
plt.title('Top 20 Authors Distribution')
plt.show()
# Plotting distributions for 'title'
plt.figure(figsize=(10, 6))
title_order = df_feat['title'].value_counts().iloc[:20].index
sns.countplot(y='title', data=df_feat, order=title_order)
plt.title('Top 20 Titles Distribution')
plt.show()
LDA¶
pyLDAvis offers an intuitive method for exploring the most important words for each topic, the weight they carry within it, and the relationship and distance between the given topics. For this multidimensional scaling reduces the topic term distribution to a two-dimensional space, retaining both the importance of a given topic within the corpus, as well as their distance to one another with the help of Jensen-Shannon Divergence as its metric. A common approach for multi-dimensional scaling.
The following interactive visualization is only properly displayed in the html version or when run locally.
prepared_data = pyLDAvis.prepare(topic_term_dists_LDA, doc_topic_dists_LDA, doc_lengths, vocab, term_frequency)
pyLDAvis.display(prepared_data)
Topic Interpretation: Analyzing the intersection of the most salient and relevant terms for each topic, aiming to synthesize the underlying themes into coherent labels. Ennui, ants, firmness, confessor, vegetables, illusion, calculation, morbid, blasted, coolies, beggars, bureau, bayonets, and terms of logical reasoning are strewn throughout many topics, providing background noise.
topic_labels = {
"Topic 1": "Ominous Atmosphere - \n Spatial and Auditory Imagery: \n vastness, archaic, Refinement, Gloom, demons.",
"Topic 2": "Emotional Dialogue - \n Fear, Secrecy, Flattery, Arousal and Strife \n - Religion and Devils.",
"Topic 3": "Status and Individuality - \n Striving, Misery and Plentifullness - Excess.",
"Topic 4": "Myths, Trials and Death - \n Persecution of Crime, Telling Tales, magic and ants.",
"Topic 5": "Excitability, Madness and Deceit - \n Aggression, conflict and glee.",
"Topic 6": "Nature and Reasoning - \n Creativity, Understanding, mixed with Fauna.",
"Topic 7": "Social Pleasantries - \n Diplomacy, Plotting to Gossip.",
"Topic 8": "Faith, Convictions, Chivalry and Death - \n Erudition, Religion and Knights. Ants.",
"Topic 9": "Fortitude, Conviction and Adventure - \n Danger and Social Station.",
"Topic 10": "Ferocity and Tragedy - \n animalistic traits, intimacy, conflict, and science.",
"Topic 11": "Ravens and Gloom - Longing, Death and Artifice.",
"Topic 12": "Home Invasion - Domestic Mystery and Conflict.",
"Topic 13": "Rituals and Festivities - \n Dance, Witchcraft and Coronations.",
"Topic 14": "Conflict, Animosity and Change - \n Emotional Changes, Death and Construction.",
"Topic 15": "Trickery and Science - \n Deceit, Reasoning and Institutions.",
"Topic 16": "Desecrated Chapel - \n Confessions and Defilement - Devils and Maniacs.",
"Topic 17": "(Un-)death, spectral bodies and judgment - \n human physicality, grief, emotions.",
"Topic 18": "Mystery and Adversity - \n Dream and fugue states, Investigation.",
"Topic 19": "Forlorn Carnival - Dances, Disgust and Intimacy.",
"Topic 20": "Science, Reasoning and Objects - \n Technology, Professions and Nature.",
"Topic 21": "War, Punishment, and Exploration.",
"Topic 22": "Emotional Dynamics and Interactions.",
"Topic 23": "War, dreams and demons.",
"Topic 24": "Human Interactions and Emotional States.",
"Topic 25": "Flattery, clothing, Interactions.",
"Topic 26": "Witchcraft, Rituals, and Fear of it - \n Banishment, Threats, and Armor.",
"Topic 27": "Dragon Attack and Defense - \n Troops, Mountains and Cynicism.",
"Topic 28": "Communion in Nature - \n Transformation, Relationships and Identity.",
"Topic 29": "Bickering, Fighting, and Mountains.",
"Topic 30": "Bureaucracy, Bargaining and Dissatisfaction.",
"Topic 31": "Exploration, Gloom, Caverns.",
"Topic 32": "Tranquility and Bustle - \n Terms of Relaxation, Calm and Action.",
"Topic 33": "Treacherous Company - on the run and scarred.",
"Topic 34": "Secrets and Suspense - \n Mystery, Devils and Assassinations.",
"Topic 35": "Mental Illness, Law and Outcasts - \n Fear, Suspicion and Struggles.",
"Topic 36": "Individualism vs. Conformity - \n Rebellion and Social Norms.",
"Topic 37": "Order and Chaos - \n Constrained Focus and Unchecked Emotions.",
"Topic 38": "Psychology, Trauma, and Secrets.",
"Topic 39": "Quest for Meaning - Self-Discovery, Transformation.",
"Topic 40": "Ambition and Struggle - Emotional Turmoil.",
"Topic 41": "Despair, Isolation and Oppression.",
"Topic 42": "Illusion, Enchantment and Betrayal.",
"Topic 43": "Woodlands, Mystery, Illusion, Beasts.",
"Topic 44": "Companionship in Times of Trial and Distress.",
"Topic 45": "Intimacy, Emotions, and Identity.",
"Topic 46": "Frustration, Society, Retreat into Nature - \n Society, Reason, Tension, negative Feelings, Forrests.",
"Topic 47": "Human Nature and the Connection to the Land, \n Myth and (Human) Nature - Solace, Inspiration, Acceptance for Hardships.",
"Topic 48": "Enthralling Garden full of Voices - \n Enchantment and Vocalization, Nature.",
"Topic 49": "Departure and Music.",
"Topic 50": "Myth, Nature, Wonder and Despair.",
"Topic 51": "Disillusionment with Society - \n Resistance, Protest, Retreat.",
"Topic 52": "Adventure, Spendor, Power and Challenges, History.",
"Topic 53": "Mercantile and Creativity - Haggling and Emotions.",
"Topic 54": "Medieval Cities, Castles and Courtship.",
"Topic 55": "Crocodiles, Massacres and Traveling.",
"Topic 56": "Exploration of an Island and Obsession.",
"Topic 57": "Carnage near a Castle.",
"Topic 58": "Weddings and Rituals - Clamoring Throng.",
"Topic 59": "Judgment and Scrutiny - Tense Diplomacy.",
"Topic 60": "Confession and marriage before \n Conscription and Battle.",
"Topic 61": "Vampires, Ragality, Experiments, \n Festivities and Sacrifice.",
"Topic 62": "Dragons, Subterraneous Lairs, Riddles and Lore.",
"Topic 63": "Hidden Dangers, Fear, Anticipation, Supernatural.",
"Topic 64": "Artistic Ambition and Trials - Mastery and the Devil.",
"Topic 65": "Atmospheric Battle Descriptions and Royalty.",
"Topic 66": "Hidden Knowledge, Learning and Secrets.",
"Topic 67": "Monsters, Art, Romance - Myth and Gloom.",
"Topic 68": "Secluded Initiation Rites.",
"Topic 69": "Seduction, Deception, Violence, Bureaucracy.",
"Topic 70": "Myth and splendor - Wealth and Castles.",
"Topic 71": "Haunted Castles and their Prophecies.",
"Topic 72": "Festivities, Noise, Crowds.",
"Topic 73": "Camps, Trenches and Weather."
}
'''
Generally speaking, the topics can be categorized in a set of main groups:
-Emotional turmoil and psychological distress
-Physical violence and combat
-Social settings, diplomacy, and court
-Self-expression and frustration with society
-Myth, lore and tales
-Forbidden truths and knowledge
-Adventure and exploration
-Ambition, greed, and regality
-Deceit and apprehension
-Science and reasoning
-Nature - woods, mountains and harbors
-Religion and sacred rituals
-Monsters, demons and undead
-Medieval settings, cities, and castles
-Dreams and illusions
'''
Visualizing the qualities of topics¶
Recreating the term relevance measure used in pyLDAvis and creating wordclouds for ease of comparison
def calculate_term_relevance(topic_term_dists, term_frequency, lambda_step=0.6):
"""
Calculate term relevance for each topic.
Relevance is defined as in pyLDAvis: lambda * log(prob of term given topic) +
(1 - lambda) * log(prob of term given topic / prob of term in corpus)
"""
# Convert term frequency to probability
term_prob = term_frequency / term_frequency.sum()
# Log probability of term given topic
log_prob_w_given_t = np.log(topic_term_dists + 1e-12) # Adding a small constant to avoid log(0)
# Log lift
log_lift = np.log(topic_term_dists / term_prob + 1e-12) # Adding a small constant to avoid division by zero
# Term relevance
term_relevance = lambda_step * log_prob_w_given_t + (1 - lambda_step) * log_lift
return term_relevance
def calculate_saliency(topic_term_dists, term_frequency):
"""
Calculate the saliency of terms according to the logic of pyLDAvis.
Saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))]
"""
# Convert term frequency to probability
term_prob = term_frequency / term_frequency.sum()
# p(t | w)
p_t_given_w = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]
# p(t)
p_t = topic_term_dists.sum(axis=0) / topic_term_dists.sum().sum()
# Calculating saliency
saliency = term_prob * np.sum(p_t_given_w * np.log(p_t_given_w / p_t), axis=0)
return saliency
def generate_word_clouds(term_relevance, saliency, topic_term_dists_LDA, vocab, n_topics):
wc_width, wc_height = 200, 200 # wc size in pixels
# Create subplot grid
fig, axs = plt.subplots(nrows=19, ncols=8, figsize=(36, 85))
axs = axs.flatten()
for i in range(n_topics):
# Generate salient word cloud
topic_saliency = saliency * topic_term_dists_LDA[i, :]
top_salient_terms = topic_saliency.argsort()[-30:][::-1]
salient_word_freq = {vocab[term]: topic_saliency[term] for term in top_salient_terms}
salient_wc = WordCloud(width=wc_width, height=wc_height, background_color='white', colormap='Greens').generate_from_frequencies(salient_word_freq)
axs[i*2].imshow(salient_wc, interpolation='bilinear')
axs[i*2].axis('off')
axs[i*2].set_title(f'Topic {i+1} - Salient', fontsize=23)
# Generate relevant word cloud
topic_relevance = term_relevance[i, :]
top_relevant_terms = topic_relevance.argsort()[-30:][::-1]
relevant_word_freq = {vocab[term]: topic_relevance[term] for term in top_relevant_terms}
relevant_wc = WordCloud(width=wc_width, height=wc_height, background_color='white', colormap='Reds').generate_from_frequencies(relevant_word_freq)
axs[i*2+1].imshow(relevant_wc, interpolation='bilinear')
axs[i*2+1].axis('off')
axs[i*2+1].set_title(f'Topic {i+1} - Relevant', fontsize=23)
# Hide the remaining axes
for i in range(n_topics*2, len(axs)):
axs[i].set_visible(False)
plt.subplots_adjust(wspace=0.5, hspace=0.5)
plt.tight_layout()
plt.show()
term_relevance = calculate_term_relevance(topic_term_dists_LDA, np.array(term_frequency))
saliency = calculate_saliency(topic_term_dists_LDA, np.array(term_frequency))
generate_word_clouds(term_relevance, saliency, topic_term_dists_LDA, vocab, topic_term_dists_LDA.shape[0])
In order to decrease the overall filesize, the following visualization is provided as an image outside of the notebook itself, please refer to topic_wordclouds.png
Topic trends over time¶
df_time = df_txt_features_LDA.copy()
topic_columns = [col for col in df_time.columns if col.startswith('Topic')]
def year_to_decade(year):
return (year // 10) * 10
df_time['decade'] = df_time['date'].apply(year_to_decade)
# Grouping by 'decade' and calculating the mean for topic distributions
decade_grouped = df_time.groupby('decade')[topic_columns].mean()
A general distribution of all topics brings with it too much consistent bottom-line noise, so we shall look more closely at those entries that at some point rise to enough prominence.
plt.figure(figsize=(20, 8)) # Keeping the graph broad
for topic in topic_columns:
plt.plot(decade_grouped.index, decade_grouped[topic], label=topic)
plt.xlabel('Decade')
plt.ylabel('Topic Distribution')
plt.title('Adjusted Topic Trends Over Decades')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=10) # Spreading out the legend further with fewer rows
plt.show()
Those topics that surpass a certain threshhold of importance throughout their life cycle.
- Filtering for maximal weight throughout their lifetime.
# Group 1: Topics that never rise beyond a consistent level
consistent_topics = [topic for topic in topic_columns if decade_grouped[topic].max() <= 8]
# Group 2: Topics that fluctuate
peaking_topics = [topic for topic in topic_columns if decade_grouped[topic].max() > 8]
plt.figure(figsize=(20, 8))
for topic in peaking_topics:
# Get the label for the topic
label = f'{topic}: {topic_labels.get(topic, "Label not found")}'
plt.plot(decade_grouped.index, decade_grouped[topic], label=label)
plt.xlabel('Decade')
plt.ylabel('Topic Distribution')
plt.title('Adjusted Topic Trends Over Decades')
# Place the legend to the right of the plot as a single vertical column
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), ncol=1)
plt.tight_layout()
plt.show()
Topics that surpass a certain threshold of fluctuation and carry a standard deviation higher than the 75th percentile across all topics, indicating that they do not maintain consistent values and vary significantly over the decades.
# Calculating the standard deviation for each topic to measure fluctuations
topic_fluctuations = decade_grouped.std()
# Setting a threshold for identifying strong fluctuations
percentile_threshold = np.percentile(topic_fluctuations, 90)
fluctuating_topics = topic_fluctuations[topic_fluctuations > percentile_threshold].index.tolist()
plt.figure(figsize=(20, 8))
for topic in fluctuating_topics:
# Get the label for the topic, combining topic number and label
label = f'{topic}: {topic_labels.get(topic, "Label not found")}'
plt.plot(decade_grouped.index, decade_grouped[topic], label=label)
plt.xlabel('Decade')
plt.ylabel('Topic Distribution')
plt.title('Adjusted Topic Trends Over Decades')
# Place the legend to the right of the plot in a single vertical column
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), ncol=1)
plt.tight_layout()
plt.show()
Both the selection of topics based on a minimum fluctuation in their importance and the grouping based on the crossing a base threshold return a similar picture, emphasizing 8 different topics and their distribution throughout time. What is visible here is that the three peaks in textual representation, around 1800, 1830, and around 1900, are in part reflected in the rise of specific topics. Topics 3, 36, and 52 peak before 1800 and then fade out of importance, 5 peaks early, decline to a moderate degree until 1830, and then remain as a constant undercurrent. 70 on the other hand rises to prominence early, falls out of use, and rises very strongly in 1830 becoming a predominant influence and to a lesser degree in 1860, remaining a stable baseline throughout as well. 51 and 65 reach a very decisive peak at 1800 and a second at 1860 and 1880 respectively. 4 only shows two peaks, one smaller at 1830 and a large spike at around 1850.
This selection offers a clear cut through all of the central motifs of the genre.
Topic 3: Status and Individuality - Striving, Misery, and Plentifulness - Excess.
Peaks in the early 1760s, mid-1780s, and early 1800s, suggest that themes of personal ambition and the consequences of excess were particularly salient during these times. This could reflect societal concerns about the individual's place in a rapidly changing social order in the underlying literature.
Topic 4: Myths, Trials, and Death - Persecution of Crime, Telling Tales, magic, and ants.
Shows a consistent presence across the timeline with notable peaks in the late 1770s and mid-1850s.
Topic 5: Excitability, Madness, and Deceit - Aggression, conflict, and glee.
Exhibits spiked around the 1790s and then again in the 1830s. This period coincides with historical events like the French Revolution, the early onset of urbanization and industrialization reflecting the tumultuous nature of the times.
Topic 36: Individualism vs. Conformity - Rebellion and Social Norms.
There's an interesting surge in the early 1790s, a sentiment that is concurrently explored by the Romantic thinkers, some of them overlapping with the authors of Gothic novels. Several topics explore their themes further.
Topic 51 & 52: Disillusionment with Society - Resistance, Protest, Retreat. Adventure, Splendor, Power and Challenges, History.
These topics seem to rise and fall in tandem at several points (e.g., 1780s and 1840s), suggesting that tales of adventure and power struggles were often accompanied by themes of societal disillusionment.
Topic 65: Atmospheric Battle Descriptions and Royalty.
Shows a peak around 1810, concurrent with the Napoleonic Wars, while the recontextualization into medieval settings of any vivid battle scenes and discussions of royalty offers a safe boundary.
Topic 70: Myth and Splendor - Wealth and Castles.
Peaks sharply in the late 1780s and has another smaller peak in the 1830s, aligning with the genre's fascination with the aristocracy and ancient sights.
The following interactive visualization is only properly displayed in the html version or when run locally.
df_LDA = df_txt_features_LDA.copy()
app = dash.Dash(__name__)
# Function to convert year to decade for grouping
def year_to_decade(year):
return (year // 10) * 10
# Formating 'decade' column as int to comply with Dash format requirements
df_LDA['decade'] = df_LDA['date'].astype(int).apply(year_to_decade)
topic_columns_LDA = [col for col in df_LDA.columns if col.startswith('Topic')]
# Grouping by 'decade' and calculating the mean for topic distributions
decade_grouped_LDA = df_LDA.groupby('decade')[topic_columns_LDA].mean()
# Calculating the standard deviation for each topic to measure fluctuations
topic_fluctuations = decade_grouped_LDA.std()
# Function to filter topics based on a fluctuation percentile threshold
def filter_topics_by_percentile(threshold_percentile):
percentile_threshold = np.percentile(topic_fluctuations, threshold_percentile)
return topic_fluctuations[topic_fluctuations > percentile_threshold].index.tolist()
# Function to update the figure based on selected topics
def create_figure(selected_topics):
fig = go.Figure()
for topic in selected_topics:
hovertext = f"{topic_labels.get(topic, topic)}\n({topic})"
fig.add_trace(go.Scatter(x=decade_grouped_LDA.index, y=decade_grouped_LDA[topic],
mode='lines', name=topic, hovertext=hovertext, hoverinfo="text+x+y"))
fig.update_layout(height=600, legend_orientation="h", legend=dict(x=0, y=1.1, xanchor='left'))
return fig
# Create slider
slider = dcc.Slider(
id='percentile-slider',
min=0,
max=100,
value=90,
marks={i: f'{i}%' for i in range(0, 101, 25)},
step=1
)
# Create dropdown (initially empty)
dropdown = dcc.Dropdown(
id='topic-dropdown',
options=[],
value=[],
multi=True
)
# App layout
app.layout = html.Div([
html.Div([slider]),
html.Div([dropdown]),
dcc.Graph(id='topic-graph')
])
# Callback for updating the dropdown options and selected values based on slider value
@app.callback(
[Output('topic-dropdown', 'options'),
Output('topic-dropdown', 'value')],
[Input('percentile-slider', 'value')]
)
def update_dropdown_options(percentile_value):
filtered_topics = filter_topics_by_percentile(percentile_value)
options = [{'label': topic, 'value': topic} for topic in filtered_topics]
return options, [option['value'] for option in options]
# Callback for updating the graph based on selected topics and percentile
@app.callback(
Output('topic-graph', 'figure'),
[Input('topic-dropdown', 'value'),
Input('percentile-slider', 'value')]
)
def update_graph(selected_topics, percentile_value):
return create_figure(selected_topics)
# Run the app
if __name__ == '__main__':
app.run_server(debug=True)
Author-Specific Topic Analysis:¶
df = df_txt_features_LDA.copy()
topic_columns = [col for col in df.columns if col.startswith('Topic')]
top_authors = df['author'].value_counts().head(20).index.tolist()
central_authors = df[df['role'] == 'Central']['author'].unique().tolist()
refined_central_authors = list(set(central_authors + top_authors))
aggregated_topics_top_authors = pd.DataFrame(index=top_authors, columns=topic_columns)
for author in top_authors:
aggregated_topics_top_authors.loc[author] = df[df['author'] == author][topic_columns].sum()
aggregated_topics_top_authors = aggregated_topics_top_authors.apply(pd.to_numeric)
# For Top Authors
top_5_topics_top_authors = pd.DataFrame(index=top_authors, columns=['Top1', 'Top2', 'Top3', 'Top4', 'Top5'])
for author in top_authors:
top_topics = aggregated_topics_top_authors.loc[author].nlargest(5).index.tolist()
top_5_topics_top_authors.loc[author] = top_topics
filtered_data_top_authors = pd.DataFrame(index=top_authors, columns=topic_columns)
for author in top_authors:
top_topics = top_5_topics_top_authors.loc[author]
filtered_data_top_authors.loc[author, top_topics] = aggregated_topics_top_authors.loc[author, top_topics]
filtered_data_top_authors.fillna(0, inplace=True)
filtered_data_top_authors = filtered_data_top_authors.apply(pd.to_numeric)
# Re-aggregate Topic Distribution for the refined list of central authors
aggregated_topics_refined_central = pd.DataFrame(index=refined_central_authors, columns=topic_columns)
for author in refined_central_authors:
aggregated_topics_refined_central.loc[author] = df[df['author'] == author][topic_columns].sum()
aggregated_topics_refined_central = aggregated_topics_refined_central.apply(pd.to_numeric)
# Identifying Top 5 Topics for the central authors
top_5_topics_refined_central = pd.DataFrame(index=refined_central_authors, columns=['Top1', 'Top2', 'Top3', 'Top4', 'Top5'])
for author in refined_central_authors:
top_topics = aggregated_topics_refined_central.loc[author].nlargest(5).index.tolist()
top_5_topics_refined_central.loc[author] = top_topics
# Preparing data for visualization
filtered_data_refined_central = pd.DataFrame(index=refined_central_authors, columns=topic_columns)
for author in refined_central_authors:
top_topics = top_5_topics_refined_central.loc[author]
filtered_data_refined_central.loc[author, top_topics] = aggregated_topics_refined_central.loc[author, top_topics]
filtered_data_refined_central.fillna(0, inplace=True)
filtered_data_refined_central = filtered_data_refined_central.apply(pd.to_numeric)
# Creating stacked bar charts with labels for the top 5 topics for Top Authors
plt.figure(figsize=(20, 10))
ax_top = filtered_data_top_authors.plot(kind='bar', stacked=True, figsize=(20, 10), legend=False)
# Adding labels within each bar for Top Authors
for i, author in enumerate(top_authors):
cum_value = 0
for topic in top_5_topics_top_authors.loc[author]:
value = filtered_data_top_authors.at[author, topic]
if value > 0:
# Positioning the label in the center of the segment
ax_top.text(i, cum_value + value/2, topic, ha='center', va='center')
cum_value += value
plt.title('Top 5 Aggregated Topic Distributions for Top Authors')
plt.xlabel('Author')
plt.ylabel('Aggregated Topic Proportions')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
<Figure size 2000x1000 with 0 Axes>
The aggregation of summarily highest importance for all authors within the corpus showed an aggressive focus on Topic 12: home invasion for Hawthorne, just as 69 - Seduction, Deception, Violence, Bureaucracy for Ambrose and 28, Communion in Nature - Transformation, Relationships and Identity for Kipling.
refined_central_authors = list(set(central_authors + top_authors))
# Re-aggregate Topic Distribution for the refined list of central authors using median
aggregated_topics_refined_central = pd.DataFrame(index=refined_central_authors, columns=topic_columns)
for author in refined_central_authors:
aggregated_topics_refined_central.loc[author] = df[df['author'] == author][topic_columns].median()
aggregated_topics_refined_central = aggregated_topics_refined_central.apply(pd.to_numeric)
# Identifying Top 5 Topics for the central authors using the updated aggregation
top_5_topics_refined_central = pd.DataFrame(index=refined_central_authors, columns=['Top1', 'Top2', 'Top3', 'Top4', 'Top5'])
for author in refined_central_authors:
top_topics = aggregated_topics_refined_central.loc[author].nlargest(5).index.tolist()
top_5_topics_refined_central.loc[author] = top_topics
# Preparing data for visualization for refined central authors
filtered_data_refined_central = pd.DataFrame(index=refined_central_authors, columns=topic_columns)
for author in refined_central_authors:
top_topics = top_5_topics_refined_central.loc[author]
filtered_data_refined_central.loc[author, top_topics] = aggregated_topics_refined_central.loc[author, top_topics]
filtered_data_refined_central.fillna(0, inplace=True)
filtered_data_refined_central = filtered_data_refined_central.apply(pd.to_numeric)
# Increase the figsize significantly to provide enough space
plt.figure(figsize=(20,6))
ax_top = filtered_data_top_authors.plot(kind='bar', stacked=True,figsize=(20, 10), legend=False)
# Decrease fontsize to ensure they fit within the segments
topic_label_fontsize = 12
# Add labels within each bar for Top Authors
for i, (idx, row) in enumerate(filtered_data_top_authors.iterrows()):
cum_value = 0
for topic in top_5_topics_refined_central.loc[idx]:
value = row[topic]
if value > 0:
# Positioning the label in the center of the segment
ax_top.text(i, cum_value + value/2, topic, ha='center', va='center', fontsize=topic_label_fontsize)
cum_value += value
ax_top.set_title('Top 5 Aggregated Topic Distributions for Top Authors', fontsize=16)
ax_top.set_xlabel('Author', fontsize=12)
ax_top.set_ylabel('Aggregated Topic Proportions', fontsize=16)
# Rotate and set fontsize for x-axis tick labels
ax_top.set_xticklabels(ax_top.get_xticklabels(), rotation=45, ha='right', fontsize=12)
plt.tight_layout()
plt.show()
<Figure size 2000x600 with 0 Axes>
The averaging of importance for each author within the corpus showed Charlotte Smith's heavy reliance on 37 Order and Chaos, Kipling's use of 45 - Enthralling Garden Full of Voices, 38 - Psychology, Trauma, and Secrets in both Lytton and Brown.
Stoker’s and Radcliffe’s bars, for example, show a high proportion of Topic 65: "Atmospheric Battle Descriptions and Royalty", which aligns with their narratives often involving conflict and nobility. The presence of Topic 12: "Home Invasion - Domestic Mystery and Conflict" is significant in the bars for several authors, including Stoker and Blackwood, which could indicate a shared interest in the intrusion of terror into personal and domestic spheres.
Thematic Shifts and Trends:
Authors with a higher proportion of themes related to societal issues, such as Hawthorne and Corelli, may reflect a more critical view of the status quo, while those with higher proportions of personal and psychological themes, like Poe and Radcliffe, might be more focused on individual experience and interiority.
Historical and Cultural Context:
Some authors show a strong leaning towards topics that may relate to historical events or cultural trends of their time. For instance, Topic 65: "Atmospheric Battle Descriptions and Royalty" in the works of Stoker and Radcliffe could suggest an influence of the political climate of their times, such as the lingering effects of the Napoleonic Wars or the upheaval of the Victorian era.
plt.figure(figsize=(20, 10))
# Plotting the stacked bar chart for Central Authors
ax_refined_central = filtered_data_refined_central.plot(kind='bar', stacked=True, figsize=(20, 10), legend=False)
# Adding labels within each bar for Central Authors
for i, author in enumerate(refined_central_authors):
cum_value = 0
for topic in top_5_topics_refined_central.loc[author]:
value = filtered_data_refined_central.at[author, topic]
if value > 0:
# Positioning the label in the center of the segment
ax_refined_central.text(i, cum_value + value/2, topic, ha='center', va='center', fontsize=8)
cum_value += value
plt.title('Top 5 Aggregated Topic Distributions for Additional Central Authors')
plt.xlabel('Author')
plt.ylabel('Aggregated Topic Proportions')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
<Figure size 2000x1000 with 0 Axes>
Here very prominent is the focus of Mathew Lewis on Topic 51: Disillusionment with Society, 52: Adventure, Spendor, Power and Challenges, History. 39: Quest for Meaning - Self-Discovery, Transformation, for Oscar Wilde, 34: Secrets and Suspense - \n Mystery, Devils and Assassinations and 12: Home Invasion - Domestic Mystery and Conflict for John Keats, 14: Conflict, Animosity, and Change - \n Emotional Changes, Death and Construction. and 69 for Coleridge, 36: Individualism vs. Conformity - \n Rebellion and Social Norms and 52: Adventure, Spendor, Power and Challenges, History for Aikin, 21 for Gilman and a general heavy reliance on 5: Excitability, Madness, and Deceit, With Walpole carrying the highest values for 5, and Shelly's second ex aequo with Lee Sophia and Reeve Clara.
Generally speaking Topic 5: "Excitability, Madness and Deceit" is a prevalent theme across many authors, reinforcing the idea that Gothic literature frequently explores psychological instability and darker aspects of human behavior. Topic 51: "Disillusionment with Society" appears significant for several authors as well, suggesting themes of resistance against societal norms and the exploration of characters who are at odds with their social context.
Topic 70: "Myth and splendor - Wealth and Castles" is prominent for authors like Charles Maturin, Arthur Machen, and Walpole, and indicates a focus on grandeur, historical settings, and perhaps a reflection on the role of the past in shaping individual identities and social structures. Oscar Wilde's most prevalent topics 52: Adventure, Spendor, Power and Challenges, History and 39: Quest for Meaning - Self-Discovery, Transformation mirror these tendencies of a nostalgic fascination with the past and a drive for self-actualization.
Sleath Eleanor, Parsons Eliza, Lee Sophia, and Reeve Clara have a significant presence of Topic 65: "Atmospheric Battle Descriptions and Royalty", which could reflect their works that delve into grand conflicts and courtship.
For instance, John Keats, Algernon Blackwood, and Bram Stoker have a considerable portion of their bars dedicated to Topic 5: "Excitability, Madness and Deceit" and Topic 12: "Home Invasion - Domestic Mystery and Conflict", suggesting a focus on personal turmoil and the encroachment of danger into personal spaces.
Recurring Topics Across Authors Median Values:
Topic 5: "Excitability, Madness, and Deceit - Aggression, conflict, and glee" seems to be a prevailing theme among almost all authors, indicating that elements of madness, deceit, and emotional extremes, Topic 51: "Disillusionment with Society - Resistance, Protest, Retreat" is also frequently present, suggesting a common narrative thread where characters grapple with societal norms and often feel a sense of disillusionment, while engaged in uncanny and intimate struggles that rage close to home and yet have a faraway air to them. Topic 10: "Ferocity and Tragedy - animalistic traits, intimacy, conflict and science.", Topic 12: "Home Invasion - Domestic Mystery and Conflict.", Topic 70: Myth and splendor - Wealth and Castles.",
Marie Corelli and Nathaniel Hawthorne share a common interest in Topic 29: "Bickering, Fighting and Mountains", which might suggest a thematic focus on interpersonal conflict and possibly the rugged landscapes that are often a backdrop in Gothic tales.
Edgar Allan Poe is unique with Topic 28: "Communion in Nature - Transformation, Relationships and Identity", resonating with Poe's themes of personal transformation, identity, and often a deep connection with the natural world as a setting for his narratives.
Arthur Machen shows a distinct association with Topic 12: "Home Invasion - Domestic Mystery and Conflict", highlighting his interest in the invasion of the domestic sphere by supernatural or mysterious settings, especially befitting his many texts on supernatural boundary transgressions and invaders from other worlds.
Nathaniel Hawthorne: Distinct Theme: Topic 70: "Myth and Splendor - Wealth and Castles"
Hawthorne’s works often grapple with the moral legacy of Puritanism, and his focus on myths and castles may be seen as an allegory for the grand narratives and moral edifices of his own culture. This theme suggests a preoccupation with the past's weight on the present, reflecting a distinctly American perspective on history, morality, and identity.
Effect: Hawthorne's use of myth and grand settings creates a sense of historical depth and moral complexity, often questioning the possibility of redemption from past sins. His characters are frequently caught between the opulence of tradition and the necessity of moral integrity, exemplified in works like "The House of the Seven Gables."
Edgar Allan Poe: Distinct Theme: Topic 28: "Communion in Nature - Transformation, Relationships and Identity"
Poe's unique theme reflects his exploration of the individual's psyche and the transformative power of nature. He frequently uses natural settings as a mirror for or a catalyst for internal psychological states.
Effect: Poe’s narratives often lead to moments of epiphany or horror as his characters confront their own identities. Nature in Poe's works is not just a backdrop but an active participant in the narrative, influencing and reflecting the characters' mental and emotional journeys.
Arthur Machen: Distinct Theme: Topic 12: "Home Invasion - Domestic Mystery and Conflict"
Machen's focus on the invasion of the domestic sphere might hint at his interest in the vulnerability of personal space and the erosion of the boundaries between the safe and the profane.
Effect: This theme often leads to a deep-seated unease, as the sanctity of home is breached by otherworldly forces, making the familiar uncanny. Machen's work could be seen as prefiguring the modern psychological horror genre that frequently uses similar themes.
Marie Corelli: Distinct Theme: Topic 29: "Bickering, Fighting and Mountains"
Corelli’s narratives weave together interpersonal conflict with dramatic natural landscapes, perhaps reflecting the emotional turmoils and societal upheavals of her time.
Effect: The recurring theme of conflict against the backdrop of imposing nature may symbolize the characters' internal struggles and the larger societal conflicts. Mountains in her work might serve as a metaphor for obstacles to be overcome or as imposing witnesses to human folly.
Sheridan Le Fanu: Distinct Theme: Topic 5: "Excitability, Madness and Deceit - Aggression, conflict and glee"
Le Fanu’s Gothic tales often revolve around psychological ambiguity and unreliable narrations, with madness and deceit as central elements.
Effect: The focus on madness and deceit creates a pervasive sense of paranoia and questions the nature of reality itself. His stories such as "Carmilla" and "Uncle Silas" often feature characters whose grip on sanity is as tenuous as the reader's understanding of the true narrative.
Bram Stoker: Distinct Theme: Topic 61: "Vampires, Regality, Experiments, Festivities and Sacrifice"
Stoker, most famous for "Dracula," prominently features themes of vampirism, which intertwine regality and horror, bringing to the fore the anxieties of the fin-de-siècle era regarding degeneration and the breakdown of social norms.
Effect: Stoker’s work creates a contrast between the allure of the aristocratic vampire and the horror of its predatory nature. This theme often explores the fear of the foreign and the taboo, reflecting societal concerns about purity, invasion, and the breakdown of Victorian social structures.
Metaphysical and Philosophical Inquiry Group: Authors in this group explore themes of existence, the supernatural, and the search for meaning. Topics like Topic 39: "Quest for Meaning - Self-Discovery, Transformation" and Topic 66: "Hidden Knowledge, Learning and Secrets" are significant. Authors: Le Fanu, Shelley, Wilde, Coleridge and Hogg
Gothic Romanticism Group: This category includes authors whose works have a strong element of romance intertwined with the Gothic, often exploring the tension between desire and morality. Topics like Topic 28: "Communion in Nature - Transformation, Relationships and Identity" and Topic 44: "Companionship in Times of Trial and Distress" or "Topic 6": "Nature and Reasoning - \n Creativity, Understanding, mixed with Fauna.", are indicative. Authors: Poe, Kipling, Le Fanu, Hawthorne
Supernatural and Horror Group: Authors who frequently delve into the supernatural, horror, and the unknown belong here. They explore themes encapsulated by topics such as Topic 61: "Vampires, Regality, Experiments, Festivities, and Sacrifice" Authors: Stoker, Byron, Stevenson
Social and Political Commentary Group: These authors use Gothic elements to critique social and political structures. Topics that stand out include Topic 36: "Individualism vs. Conformity - Rebellion and Social Norms" and Topic 51: "Disillusionment with Society - Resistance, Protest, Retreat". Authors: Hawthorne, Brown, Lytton, Gaskell, Chambers, Ainsworth, Machen, Scott, Lee Vernon, Smith Charlotte, Stoker, Shelly Mary, Radcliff, Blackwood, Wharton, Le Fanu, Corelli share a method of expressing social discontent with the use of topic 51.
Historical and Mythic Reconstruction Group: Works by these authors are characterized by a strong sense of history and the interweaving of myth within their narratives. Prominent topics are Topic 54: "Medieval Cities, Castles and Courtship" and Topic 70: "Myth and Splendor - Wealth and Castles". Authors: Radcliffe, Hawthorne, Corelli, Le Fanu, Wharton, Blackwood, Stoker, Lee Vernon, Scott, Machen, Ainsworth, Gaskell
Pioneers of the Psychological Thriller Group: This grouping is for authors who laid the groundwork for what would become the psychological thriller, focusing on the human mind's complexities and its vulnerabilities. Topics such as Topic 5: "Excitability, Madness and Deceit", Topic 38: "Psychology, Trauma and Secrets" and "Topic 44": "Companionship in Times of Trial and Distress.", are central. Authors: Le Fanu, Wharton, Blackwood, Radcliffe, Shelley, Stoker, Smith Charlotte, Bierce, Machen, Chambers
Nature and the Sublime Group: Authors in this group integrate the natural world deeply into their Gothic narratives, often to evoke feelings of the sublime or to reflect the characters' inner turmoil. Look for topics like Topic 6: "Nature and Reasoning - Creativity and Understanding, mixed with Nature" and Topic 28: "Communion in Nature - Transformation, Relationships and Identity". Authors: Poe, Shelley, Kipling, Chambers.
Conflict and Societal Restructure Group: These authors focus on the chaos and order of society, the collapse of old structures, and the struggle for new identities. Topics such as Topic 14: "Conflict, Animosity and Change", Topic 37: "Order and Chaos - Constrained Focus and Unchecked Emotions" and Topic 29: "Bickering, Fighting and Mountains", are highlighted.: Authors: Bierce, Hawthorne, Marie Corelli, Radcliffe, Smith Charlotte
The most prevalent topics among these authors throughout time¶
In order to get a better idea of who published at what time and thus influenced the topic distribution, the distribution of publications per author will be plotted
top_authors = df['author'].value_counts().head(20).index.tolist()
central_authors = df[df['role'] == 'Central']['author'].unique().tolist()
refined_central_authors = list(set(central_authors + top_authors))
publication_dates = df[df['author'].isin(refined_central_authors)].groupby(['date', 'author']).size().unstack(fill_value=0)
# Generate the plot
fig, ax = plt.subplots(figsize=(20, 10))
# Plot the data
publication_dates.plot(kind='bar', stacked=True, colormap='nipy_spectral', edgecolor='none', ax=ax)
# Iterate over each stack (author) in the bar chart
for i, author in enumerate(publication_dates.columns):
bars = ax.containers[i]
labels = [author[:3].upper() if bar.get_height() > 0 else '' for bar in bars] # Label only bars with height > 0
ax.bar_label(bars, labels=labels, label_type='center', fontsize=7) # Set labels to the center of each bar
ax.legend(title='Authors', bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=8)
ax.spines['bottom'].set_visible(True)
ax.tick_params(bottom=True, labelbottom=True)
ax.set_title('Publication Date Distribution of Central Authors')
ax.set_ylabel('Number of Publications')
# Avoiding clipping elements on the x axis
years = publication_dates.index
ax.set_xticks(range(0, len(years), 1))
ax.set_xticklabels([years[i] for i in range(0, len(years), 1)], rotation=45)
plt.tight_layout()
plt.show()
combined_topics = pd.concat([aggregated_topics_top_authors, aggregated_topics_refined_central])
# Determining the 10 most prevalent topics across the combined set
top_10_topics = combined_topics.sum().nlargest(10).index.tolist()
# Time Series
# Aggregating occurrences of each of the 15 topics by year
time_series_data = df[df['author'].isin(top_authors + refined_central_authors)]
# Creating a DataFrame to store the yearly aggregated values for each topic
yearly_topic_aggregation = pd.DataFrame(index=time_series_data['date'].unique(), columns=top_10_topics)
# Aggregating the topics by year
for topic in top_10_topics:
yearly_data = time_series_data.groupby('date')[topic].sum()
yearly_topic_aggregation[topic] = yearly_data
# Sorting the index to ensure it's in chronological order
yearly_topic_aggregation.sort_index(inplace=True)
combined_topics = pd.concat([aggregated_topics_top_authors, aggregated_topics_refined_central])
top_10_topics = combined_topics.sum().nlargest(10).index.tolist()
# Time Series
# Aggregating occurrences of each of the 15 topics by year
time_series_data = df[df['author'].isin(top_authors + refined_central_authors)]
# Creating a DataFrame to store the yearly aggregated values for each topic
yearly_topic_aggregation = pd.DataFrame(index=time_series_data['date'].unique(), columns=top_10_topics)
# Aggregating the topics by year
for topic in top_10_topics:
yearly_data = time_series_data.groupby('date')[topic].sum()
yearly_topic_aggregation[topic] = yearly_data
# Sorting the index to ensure it's in chronological order
yearly_topic_aggregation.sort_index(inplace=True)
# Setting up the grid for facet wrap
plt.figure(figsize=(20, 20))
gs = gridspec.GridSpec(5, 2) # 5 rows, 2 columns
# Creating individual plots for each of the top 10 topics
for i, topic in enumerate(top_10_topics):
ax = plt.subplot(gs[i])
# Retrieve the label from topic_labels, or use the topic name if not found
label = topic_labels.get(topic, topic)
ax.plot(yearly_topic_aggregation.index, yearly_topic_aggregation[topic])
ax.set_title(f"{label} ({topic})") # Include the label and topic in the title
ax.set_xlabel('Year')
ax.set_ylabel('Aggregated Occurrence')
ax.grid(True)
plt.tight_layout()
plt.show()
The topics 3, 45, 34, 12, 65, and 70 show an aggressive and out-of-the-ordinary spike at around 1837 which is due to the large sway that Hawthorn holds on the corpus in this particular timeframe. Even if most of his texts do not partake too heavily in topic 34. "Vision of the Fountain" is composed of 98% of this topic. Befitting for a text focused on unraveling the message a dream state is conveying. The quite jagged, but strong shift in influence in the 1870ies is caused by Le Fanu, whose main contributing topics 60,12, 51, 70, and 65 are heavily affected. Showing how immensely influential his voice is to the most prevalent topics of the corpus. 5, 51, 12, 34, 38, and 45 show yet another spike around 1898 due to Corelli and Machen. While Machen, like Le Fanu, has a very classical profile fitting the trend, Corelli is highly unique in her distribution of topics. Dealing with fighting, strife, and exploration.
Contribution of authors as well as individual works to the most distinct topics and most important topics according to the previous results, as well as the results of the pyLDAvis results of multidimensional scaling
author_topics_comparison = {}
for author in top_authors:
author_sum_topics = df[df['author'] == author][topic_columns].sum().nlargest(5).index.tolist()
author_median_topics = df[df['author'] == author][topic_columns].median().nlargest(5).index.tolist()
# Store both lists in a dictionary for the author
author_topics_comparison[author] = {
'Sum_Topics': author_sum_topics,
'Median_Topics': author_median_topics
}
author_topics_comparison
{'Hawthorne, Nathaniel': {'Sum_Topics': ['Topic 70',
'Topic 3',
'Topic 65',
'Topic 56',
'Topic 12'],
'Median_Topics': ['Topic 70',
'Topic 52',
'Topic 35',
'Topic 3',
'Topic 29']},
'Corelli, Marie': {'Sum_Topics': ['Topic 51',
'Topic 70',
'Topic 29',
'Topic 42',
'Topic 31'],
'Median_Topics': ['Topic 51',
'Topic 29',
'Topic 42',
'Topic 70',
'Topic 31']},
'Le Fanu, Sheridan': {'Sum_Topics': ['Topic 70',
'Topic 5',
'Topic 51',
'Topic 60',
'Topic 65'],
'Median_Topics': ['Topic 70',
'Topic 51',
'Topic 5',
'Topic 12',
'Topic 39']},
'Poe, Edgar Allan': {'Sum_Topics': ['Topic 10',
'Topic 28',
'Topic 44',
'Topic 4',
'Topic 9'],
'Median_Topics': ['Topic 4', 'Topic 70', 'Topic 10', 'Topic 9', 'Topic 7']},
'Wharton, Edith': {'Sum_Topics': ['Topic 9',
'Topic 65',
'Topic 5',
'Topic 51',
'Topic 34'],
'Median_Topics': ['Topic 65',
'Topic 5',
'Topic 70',
'Topic 51',
'Topic 34']},
'Blackwood, Algernon': {'Sum_Topics': ['Topic 51',
'Topic 5',
'Topic 18',
'Topic 9',
'Topic 70'],
'Median_Topics': ['Topic 51',
'Topic 5',
'Topic 18',
'Topic 12',
'Topic 70']},
'Radcliffe, Ann': {'Sum_Topics': ['Topic 5',
'Topic 14',
'Topic 51',
'Topic 38',
'Topic 67'],
'Median_Topics': ['Topic 5', 'Topic 38', 'Topic 51', 'Topic 70', 'Topic 9']},
'Shelley, Mary': {'Sum_Topics': ['Topic 5',
'Topic 51',
'Topic 38',
'Topic 65',
'Topic 66'],
'Median_Topics': ['Topic 5', 'Topic 38', 'Topic 51', 'Topic 65', 'Topic 9']},
'Stoker, Bram': {'Sum_Topics': ['Topic 12',
'Topic 70',
'Topic 65',
'Topic 61',
'Topic 5'],
'Median_Topics': ['Topic 70',
'Topic 65',
'Topic 51',
'Topic 5',
'Topic 12']},
'Smith, Charlotte': {'Sum_Topics': ['Topic 3',
'Topic 65',
'Topic 38',
'Topic 5',
'Topic 51'],
'Median_Topics': ['Topic 38',
'Topic 51',
'Topic 5',
'Topic 37',
'Topic 34']},
'Lee, Vernon': {'Sum_Topics': ['Topic 22',
'Topic 70',
'Topic 60',
'Topic 51',
'Topic 65'],
'Median_Topics': ['Topic 51',
'Topic 52',
'Topic 7',
'Topic 12',
'Topic 43']},
'Bierce, Ambrose': {'Sum_Topics': ['Topic 49',
'Topic 10',
'Topic 39',
'Topic 32',
'Topic 69'],
'Median_Topics': ['Topic 10',
'Topic 69',
'Topic 29',
'Topic 12',
'Topic 5']},
'Scott, Walter': {'Sum_Topics': ['Topic 54',
'Topic 5',
'Topic 65',
'Topic 51',
'Topic 12'],
'Median_Topics': ['Topic 5',
'Topic 65',
'Topic 51',
'Topic 12',
'Topic 60']},
'Kipling, Rudyard': {'Sum_Topics': ['Topic 65',
'Topic 31',
'Topic 8',
'Topic 18',
'Topic 28'],
'Median_Topics': ['Topic 65', 'Topic 5', 'Topic 8', 'Topic 18', 'Topic 45']},
'Machen, Arthur': {'Sum_Topics': ['Topic 51',
'Topic 12',
'Topic 5',
'Topic 65',
'Topic 70'],
'Median_Topics': ['Topic 51',
'Topic 12',
'Topic 5',
'Topic 70',
'Topic 65']},
'Ainsworth, William Harrison': {'Sum_Topics': ['Topic 35',
'Topic 34',
'Topic 5',
'Topic 70',
'Topic 37'],
'Median_Topics': ['Topic 5',
'Topic 51',
'Topic 42',
'Topic 35',
'Topic 38']},
'Chambers, Robert William': {'Sum_Topics': ['Topic 5',
'Topic 46',
'Topic 12',
'Topic 51',
'Topic 10'],
'Median_Topics': ['Topic 5', 'Topic 51', 'Topic 10', 'Topic 46', 'Topic 6']},
'Gaskell, Elizabeth': {'Sum_Topics': ['Topic 5',
'Topic 51',
'Topic 73',
'Topic 52',
'Topic 70'],
'Median_Topics': ['Topic 5',
'Topic 51',
'Topic 70',
'Topic 12',
'Topic 52']},
'Lytton, Edward Bulwer Lyt': {'Sum_Topics': ['Topic 51',
'Topic 38',
'Topic 5',
'Topic 12',
'Topic 66'],
'Median_Topics': ['Topic 51',
'Topic 5',
'Topic 38',
'Topic 12',
'Topic 65']},
'Brown, Charles Brockden': {'Sum_Topics': ['Topic 51',
'Topic 38',
'Topic 5',
'Topic 65',
'Topic 10'],
'Median_Topics': ['Topic 5',
'Topic 51',
'Topic 38',
'Topic 10',
'Topic 65']}}
Excitability, Madness, and Deceit (Topic 5) Influence: 10 or 38 There are noticeable spikes throughout the timeline, with a significant peak at 1800 and 1870. The former encompasses the activities of Radcliff, Shelly, and Lewis as some of the founders of the genre, while the latter is due to Stoker, Le Fanu, Poe (+ Related Topics: Wharton, Blackwood, Radcliffe, Mary Shelley, Smith Charlotte, Bierce, Scott Walter, Machen, Ainsworth, Lytton, Brown)
Myth and Splendor - Wealth and Castles (Topic 70) Influence: 4, 7, 65, 1 There's a particularly high peak around the late 1700s, which could correlate with the Romantic movement's interest in the past and the supernatural as seen in the works of authors like Ann Radcliffe and Hawthorne. The decline post-1800 might indicate a shift toward more realistic or psychological narratives. Others like Corelli, Machen, Ainsworth, and Stoker picked up the theme later again. (+ Related Topics: Poe, Le Fanu, Wharton, Smith Charlotte, Lee Vernon, Lytton, Brown Charles Brockden, )
Disillusionment with Society (Topic 51) Influence: 19 The topic peaks sharply in the early 1800s and again in the early 1900s, possibly reflecting periods of social upheaval and reform, which might be explored in the works of authors such as Radcliffe, Hawthorne, Le Fanu, and Shelley. But also: Corelli, Wharton, Stoker, Scott, Ainsworth, Gaskell
Atmospheric Battle Descriptions and Royalty (Topic 65) Influence: 1, 5, 70 This topic shows a pronounced peak in the early 1800s, aligning with the Napoleonic Wars, which might have influenced Gothic literature's thematic content, as seen in the writings of the era that deal with grand historical events and their aftermath. Relevant for: Le Fanu, Wharton, Smith Charlotte, Lee Vernon, Lytton, Brown Charles Brockden, Stoker (Related Topics: Corelli, Machen, Ainsworth, Le Fanu, Poe)
Home Invasion - Domestic Mystery and Conflict (Topic 12) Influence: 4, 65, 34 The peaks in the early 1800s and early 1900s could reflect societal anxieties about the sanctity of the home and the individual's security during times of social change, a theme evident in the works of Stoker and Le Fanu. (Related Topics: Wharton, Ainsworth)
Ferocity and Tragedy (Topic 10) Influence: 65, 5, 45 The graph shows peaks in the late 1700s and then again in the mid-1800s, which might correspond to periods where themes of primal instincts and the questioning of humanity became prominent, perhaps in response to the Enlightenment and later, the Industrial Revolution. Relevant for Chambers and Brown (Related Topics: Poe, Bierce)
Secrets and Suspense - Mystery, Devils and Assassinations (Topic 34) Influence: 38, 12, 11, 4 There's a notable peak around the 1790s, potentially reflecting the influence of the French Revolution and the rise of Romanticism, with its emphasis on emotion and individual experience, as seen in the works of authors like Radcliffe and Lewis.
Psychology, Trauma, and Secrets (Topic 38) Influence: 10, 17, 15 A steady increase into the 19th century reflects the growing interest in human psychology and the exploration of trauma, possibly influenced by the psychological theories emerging at the time and explored in Gothic fiction by authors like Poe.
Status and Individuality - Striving, Misery and Plentifullness - Excess (Topic 3) Influence: 6, 5, 70 The peak in the late 1700s may be associated with the social upheavals of the time, such as the American and French revolutions, which challenged existing hierarchies and social structures, themes explored in the literature of authors like Hawthorne. (Related Topics: Chambers)
Intimacy, Emotions, and Identity (Topic 45) Influence: 10, 3, 2 The graph shows a steady presence with a few peaks, particularly in the mid-1800s, which could correspond to a focus on personal relationships and the inner self, possibly explored by authors like Charlotte Brontë or Kipling. (Related Topics: Poe, Bierce)
Thematic Grouping and Inter-relationships¶
1 - 11, 17, 70 - Atmosphere, vast, archaic, refined
2 - 5, 10, 45 - Emotions, Arousal, Fear, Secrecy
3 - 6, 5, 70 - Individualism, Status, Excess
4 - 17, 70, 34 - Myth and Crime
5 - 10, 38 - Aggression and Emotion
6 - 8, 20 - Nature & Reasoning
7 - 2, 19 - Socializing, Courtship
8 - 9, 13 - Faith, Knighthood and Knowledge
9 - 8, 16, 65 - Conviction and Adventure
10 - 65, 5, 45 - Intimacy and Conflict, Tragedy
11 - 17, 1, 34 - Doom & Gloom
12 - 4, 65, 34 - Home Invasion
13 - 4, 16, 19 - Rituals, Dance, Magic
14 - 5, 65, 17 - Conflict, Death
15 - 7, 5 - Trickery and Science
16 - 9, 13 - Desecrated Chapel
17 - 4, 11, 14 - Undead, judgment and grief
18 - 4, 17 - Mystery and Adversity
19 - 10, 51, 13 - Forlorn Carnival
20 - 6, 8 - Science and Nature
34 - 38, 12, 11, 4 - Secrets, mystery, Suspense
38 - 10, 17, 15 - Psychology, Trauma, Secrets
45 - 10, 3, 2 - Intimacy, Emotions, Identity
51 - 19 - Disillusionment with Society
65 - 1, 5, 70 - Battle, Atmosphere, Royalts
70 - 4, 7, 65, 1 - Myth, Wealth, Castles
Comparison of the influence of all authors on these topics¶
relevant_topics = [f"Topic {i}" for i in range(1, 21)] + ["Topic 70", "Topic 65", "Topic 51", "Topic 45", "Topic 38", "Topic 34"]
relevant_topics = [topic for topic in relevant_topics if topic in df.columns]
# Aggregate data: Calculate the sum of contributions for each author in each topic
author_topic_contribution = df.groupby('author')[relevant_topics].sum()
# For each topic, find the top 5 contributing authors
top_authors_per_topic = {topic: author_topic_contribution[topic].nlargest(5) for topic in relevant_topics}
# Adjust the figure and subplots to accommodate all 26 topics (using a 7x4 grid)
fig, axes = plt.subplots(7, 4, figsize=(46, 85))
# Flatten the array of axes for easy iteration
axes = axes.flatten()
for i, (topic, authors) in enumerate(top_authors_per_topic.items()):
# Check to ensure we don't go out of bounds
if i < len(axes):
sns.barplot(ax=axes[i], x=authors.values, y=authors.index, palette="Blues_d", hue=authors.index, legend=False)
label = topic_labels.get(topic, topic)
axes[i].set_title(f"{label}\n({topic})") # Include the label and topic in the title
axes[i].set_xlabel('Contribution')
axes[i].set_ylabel('Author')
# Hide any unused subplots
for j in range(i+1, len(axes)):
axes[j].set_visible(False)
# Increase spacing between plots
fig.subplots_adjust(hspace=0.6, wspace=1.0)
plt.show()
Strongest Topic Associations¶
Henry James Archaic Atmosphere (1)
Wharton -> Gloom and Longing, Blasphemy, Battles & Nature (11, 16, 65, 6)
Walter Scott -> Gloom and Longing (11)
Corelli Marrie -> Emotions, Status, Convictions, Institutions, medieval, Mystery, Dances, Social Discontent (2, 3, 8, 15, 18,19, 51)
Radcliffe -> Emotions, Conflict, Madness, Social Discontent (2, 5, 14, 51)
Poe -> Gossip, Gloom, Undead, Mystery, Animals (4, 7, 9, 10, 17)
Bierce Ambrose -> Ferocity Tragedy (10)
Le Fan -> Madness & Romanticism, Longing, repulsive Intimacy, archaic atmosphere (5, 6, 11, 19, 70, 45)
Blackwood -> Madness, Adventure, Conviction, Dreams and Mystery, Societal Discontent, Identity (5, 9, 18, 51, 45)
Wilde -> Conviction & Death (8)
Keats -> Home Invasion & Mystery and Conflict (12)
Lee Vernon -> Social Pleasantries and Scheming (7)
Stoker -> Home Invasion, Desecration, Dreams & Mystery, Castles & Myth (16, 12, 18, 70)
Hawthorne -> Home Invasion, Witchcraft, Status & Individuality, Deceit & Institutions, Mystery, Merriment (3, 12, 13, 15, 16, 18, 19, 70, 65)
Rymer, James -> Rituals (13)
Coderidge -> Conflict, Emotions (14)
Machen -> Undead (17)
La Spina Grey -> Festivities, Intimacy, and Disgust (19)
Kipling -> Battles & Royalty (65)
Byron -> Intimacy & Identity (45)
Smith Charlotte -> (38)
Shelley Mary -> Psychological Trauma, Madness & Aggression, Trickery and Science, Science & Animalistic Violence (15,10, 38, 5)
Coleridge -> Secrets and Demons (34)
Brown Charles Brockden -> Psychological Trauma (38)
The top texts per topic among the most prevalent topics¶
# Aggregate data: Calculate the sum of contributions for each text in each topic
title_topic_contribution = df.groupby('title')[relevant_topics].sum()
# For each topic, find the top 5 contributing texts
top_titles_per_topic = {topic: title_topic_contribution[topic].nlargest(5) for topic in relevant_topics}
# Aggregate data: Calculate the sum of contributions for each text in each topic
title_topic_contribution = df.groupby(['title', 'author'])[relevant_topics].sum()
# For each topic, find the top 5 contributing texts
top_titles_per_topic = {topic: title_topic_contribution[topic].nlargest(5) for topic in relevant_topics}
fig, axes = plt.subplots(7, 4, figsize=(46, 95))
# Flatten the array of axes for easy iteration
axes = axes.flatten()
# Setting font sizes for readability
title_fontsize = 14
label_fontsize = 12
tick_fontsize = 10
for i, (topic, titles) in enumerate(top_titles_per_topic.items()):
# Check to ensure we don't go out of bounds
if i < len(axes):
sns.barplot(ax=axes[i], x=titles.values, y=[f"{title[0]}\nby {title[1]}" for title in titles.index], palette="Blues_d", hue=titles.index, legend=False)
label = topic_labels.get(topic, topic)
axes[i].set_title(f"{label}\n({topic})", fontsize=title_fontsize)
axes[i].set_xlabel('Contribution', fontsize=label_fontsize)
axes[i].set_ylabel('Texts', fontsize=label_fontsize)
axes[i].tick_params(labelsize=tick_fontsize)
# Hide any unused subplots
for j in range(i+1, len(axes)):
axes[j].set_visible(False)
# Adjusting the layout with better spacing
plt.subplots_adjust(hspace=0.6, wspace=1.0)
plt.show()
Deviations and additional observations on a comparison of influence on the level of the text¶
Wharton -> Frightful Dialogue, Myth and Trials, Chivalry & Faith, Rituals & Magic (2,4, 8, 13)
Walpole -> Aggression & Madness (5)
Kipling -> Chivalry & Faith (8)
Shelly Marry -> Gloom, Doom, and Longing (11)
Vernon Lee -> Undeath & Grief (17)
Hawthorne -> Myth, Wealth and Castles has three entries by Hawthorne - prime contributor (70) !!!
Moore Thomas -> Intimacy, Identity, and Emotions (45)
Parson, Eliza -> Psychological Trauma (38)
Lytton -> Psychological Trauma (38)
Gender-Based Analysis¶
Negative scores for topics associated with female authors indicate that these topics have a lesser distinct association with female authors compared to their association with male authors. This approach focuses on enhancing the distinctiveness of each gender for specific topics, revealing topics where one gender's contribution is relatively more significant than the other's.
This distinctiveness score will be the ratio of the specific contribution of the gender to the total contribution, subtracted from the contribution of all others. A higher score indicates greater distinctiveness. It aims to represent the degree to which a topic is associated with one gender while minimizing the contribution of the other.
Contribution among the leading topics:¶
# Calculate the total contribution for each topic
total_contributions = df[relevant_topics].sum()
# Calculate specific contributions for genders
specific_contributions = df.groupby('gender')[relevant_topics].sum()
# Calculate distinctiveness score: specific contribution divided by total contribution
distinctiveness_scores = specific_contributions.div(total_contributions)
# Subtracting the sum of contributions of other genders to enhance distinctiveness
for gender in distinctiveness_scores.index:
other_genders = distinctiveness_scores.index.difference([gender])
distinctiveness_scores.loc[gender] -= distinctiveness_scores.loc[other_genders].sum()
# Identifying top 5 distinct topics for each gender
top_distinct_topics = {gender: distinctiveness_scores.loc[gender].nlargest(5) for gender in distinctiveness_scores.index}
# Extracting the top 5 distinct topics for genders
top_f_topics = top_distinct_topics['f']
top_m_topics = top_distinct_topics['m']
# Creating bar charts
fig, axes = plt.subplots(2, 1, figsize=(15, 20))
# Mapping topic names to labels for female authors
f_topic_labels = [f"{topic_labels.get(topic, topic)}\n({topic})" for topic in top_f_topics.index]
top_f_topics.plot(kind='bar', ax=axes[0], color='#ff9999')
axes[0].set_title("Top 5 Distinct Topics for female authors")
axes[0].set_ylabel("Distinctiveness Score")
axes[0].set_xlabel("Topics")
axes[0].set_xticklabels(f_topic_labels, rotation=45, ha='right') # Set custom x-tick labels
# Mapping topic names to labels for male authors
m_topic_labels = [f"{topic_labels.get(topic, topic)}\n({topic})" for topic in top_m_topics.index]
top_m_topics.plot(kind='bar', ax=axes[1], color='#66b3ff')
axes[1].set_title("Top 5 Distinct Topics for male authors")
axes[1].set_ylabel("Distinctiveness Score")
axes[1].set_xlabel("Topics")
axes[1].set_xticklabels(m_topic_labels, rotation=45, ha='right') # Set custom x-tick labels
plt.tight_layout()
plt.show()
Notably, the scores are negative, indicating these topics are less distinctly associated with female authorship compared to their association with male authors. The magnitude of the negative value represents the degree of this lesser association.
Contribution among all topics¶
all_topics = [col for col in df.columns if col.startswith('Topic ')]
total_contributions_all_topics = df[all_topics].sum()
# Calculate specific contributions for each gender for all topics
specific_contributions_all_topics = df.groupby('gender')[all_topics].sum()
# Calculate distinctiveness score for all topics
distinctiveness_scores_all_topics = specific_contributions_all_topics.div(total_contributions_all_topics)
# Subtracting the sum of contributions of other gender to enhance distinctiveness
for gender in distinctiveness_scores_all_topics.index:
other_genders = distinctiveness_scores_all_topics.index.difference([gender])
distinctiveness_scores_all_topics.loc[gender] -= distinctiveness_scores_all_topics.loc[other_genders].sum()
top_distinct_topics_all_topics = {gender: distinctiveness_scores_all_topics.loc[gender].nlargest(5) for gender in distinctiveness_scores_all_topics.index}
top_f_topics_all = top_distinct_topics_all_topics['f']
top_m_topics_all = top_distinct_topics_all_topics['m']
fig, axes = plt.subplots(2, 1, figsize=(14, 16))
# Generate the labels for female authors with topic numbers on a new line
f_labels = [f"{topic_labels.get(topic, 'No Label')}\n({topic.split(' ')[-1]})" for topic in top_f_topics_all.index]
top_f_topics_all.plot(kind='bar', ax=axes[0], color='#ff9999')
axes[0].set_title("Top 5 Distinct Topics for 'f' Gender")
axes[0].set_ylabel("Distinctiveness Score")
axes[0].set_xlabel("Topics")
axes[0].set_xticklabels(f_labels, rotation=45)
# Generate the labels for male authors with topic numbers on a new line
m_labels = [f"{topic_labels.get(topic, 'No Label')}\n({topic.split(' ')[-1]})" for topic in top_m_topics_all.index]
top_m_topics_all.plot(kind='bar', ax=axes[1], color='#66b3ff')
axes[1].set_title("Top 5 Distinct Topics for 'm' Gender")
axes[1].set_ylabel("Distinctiveness Score")
axes[1].set_xlabel("Topics")
axes[1].set_xticklabels(m_labels, rotation=45)
plt.tight_layout()
plt.show()
# Selecting the top distinct topics for each gender from the full topic range
top_f_topics_list = top_f_topics_all.index.tolist()
top_m_topics_list = top_m_topics_all.index.tolist()
# Function to find the texts with the highest contribution to a given topic
def find_representative_texts(topic, num_texts=3):
return df.sort_values(by=topic, ascending=False)[['title', 'author', topic]].head(num_texts)
# Finding representative texts for each of the top topics
representative_texts_f = {topic: find_representative_texts(topic) for topic in top_f_topics_list}
representative_texts_m = {topic: find_representative_texts(topic) for topic in top_m_topics_list}
condensed_representative_texts = {
"f_gender": {topic: texts[['title', 'author']].to_dict(orient='records') for topic, texts in representative_texts_f.items()},
"m_gender": {topic: texts[['title', 'author']].to_dict(orient='records') for topic, texts in representative_texts_m.items()}
}
condensed_representative_texts
{'f_gender': {'Topic 22': [{'title': 'Hauntings', 'author': 'Lee, Vernon'},
{'title': 'Hauntings', 'author': 'Lee, Vernon'},
{'title': 'Arthur Mervyn; Or, Memoirs Of The Year 1793',
'author': 'Brown, Charles Brockden'}],
'Topic 67': [{'title': 'Superstition: An Ode', 'author': 'Radcliffe, Ann'},
{'title': 'The Yellow Wallpaper', 'author': 'Gilman, Charlotte Perkins'},
{'title': "The Damned Thing\n1898, From 'In the Midst of Life'",
'author': 'Bierce, Ambrose'}],
'Topic 38': [{'title': 'The Banished Man', 'author': 'Smith, Charlotte'},
{'title': 'The Castle Of Wolfenbach', 'author': 'Parsons, Eliza'},
{'title': 'The Emigrants', 'author': 'Smith, Charlotte'}],
'Topic 21': [{'title': 'Villette', 'author': 'Brontë, Charlotte'},
{'title': 'The Yellow Wallpaper', 'author': 'Gilman, Charlotte Perkins'},
{'title': 'The Grey Woman', 'author': 'Gaskell, Elizabeth'}],
'Topic 72': [{'title': 'A Beleaguered City, Being A Narrative Of Certain Recent Events In The City Of Semur, In The Department Of The Haute Bourgogne. A Story Of The Seen And The Unseen:',
'author': 'Oliphant, Margaret'},
{'title': 'The Death Of Halpin Frayser', 'author': 'Bierce, Ambrose'},
{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel'}]},
'm_gender': {'Topic 44': [{'title': 'The Tell-Tale Heart',
'author': 'Poe, Edgar Allan'},
{'title': 'In Search of the Unknown', 'author': 'Chambers, Robert William'},
{'title': 'The Narrative Of Arthur Gordon Pym Of Nantucket',
'author': 'Poe, Edgar Allan'}],
'Topic 50': [{'title': 'Alonzo The Brave And Fair Imogine',
'author': 'Lewis, Matthew'},
{'title': "The Monkey'S Paw", 'author': 'Jacobs, William Wymark'},
{'title': "The Monkey's Paw\nThe Lady of the Barge and Others, Part 2.",
'author': 'Jacobs, William Wymark'}],
'Topic 54': [{'title': 'Woodstock; or, the Cavalier',
'author': 'Scott, Walter'},
{'title': "The Damned Thing\n1898, From 'In the Midst of Life'",
'author': 'Bierce, Ambrose'},
{'title': 'Varney The Vampire', 'author': 'Rymer, James Malcolm'}],
'Topic 28': [{'title': 'The Oval Portrait', 'author': 'Poe, Edgar Allan'},
{'title': 'The Phantom Rickshaw, and Other Ghost Stories',
'author': 'Kipling, Rudyard'},
{'title': 'In Search of the Unknown',
'author': 'Chambers, Robert William'}],
'Topic 49': [{'title': 'The Vampyre', 'author': 'Stagg, John'},
{'title': 'An Occurrence at Owl Creek Bridge', 'author': 'Bierce, Ambrose'},
{'title': "The Monkey's Paw\nThe Lady of the Barge and Others, Part 2.",
'author': 'Jacobs, William Wymark'}]}}
It is difficult to pass any judgment on these topics, especially not any readily gender-coded ones, given that they seem to mirror one another in the general content. Both groupings share a topic related to some form of entertainment, some associations with traveling, some mythical and fantastical associations, and some associations with distress. Both have topics with associations of romance and emotions. Only a mild difference might be put forth, that the strongly male topics covering emotions have a stronger association with Trials, Honor, and courtship offer a more formal and restrained type of interaction. "Companionship in Times of Trial and Distress." encompasses terms like "brood, firmness, accommodate, acceptance, and conducted equilibrium." those carry more noise at that. Meanwhile the topic "Emotional Dynamics and Interactions" with words like "breathless, hug, vociferating, moan, ruffled, brazen" has a more immediate, unmediated, and passionate note to them.
While none of these gender-coded ones are among the most defining topics for the whole corpus, "38 - Psychology Traum and Secrets" is prevalent enough to be ranked among the 20 most influential ones, showing up for Mary Shelly, Charlotte Smith, but also Charles Brockden Brown, Parson Eliza and Edward Bulwer Lyt Lytton as defining elements. Among the most influential texts for this topic, there are also texts by Ann Radcliffe, Marie Corelli, Lee Sophia, and many other female authors within the corpus.
The same holds for "28 - Communion in Nature - Transformation, Relationships, and Identity", which can be considered the topic of Romanticists and Decadence writers with texts from Poe, Byron, Wilde and Hawthorne contributed most strongly to them.
Nationality-Based Analysis¶
Just as for gender, for each topic, a distinctiveness score for each nationality shall be calculated, focusing on minimizing the contribution of other nationalities.
total_contributions_nationality = df[all_topics].sum()
# Calculate specific contributions for each nationality for all topics
specific_contributions_nationality = df.groupby('nationality')[all_topics].sum()
# Calculate distinctiveness score for all topics for each nationality
distinctiveness_scores_nationality = specific_contributions_nationality.div(total_contributions_nationality)
# Subtracting the sum of contributions of all other nationalities to enhance distinctiveness
for nationality in distinctiveness_scores_nationality.index:
other_nationalities = distinctiveness_scores_nationality.index.difference([nationality])
distinctiveness_scores_nationality.loc[nationality] -= distinctiveness_scores_nationality.loc[other_nationalities].sum()
top_distinct_topics_nationality = {nationality: distinctiveness_scores_nationality.loc[nationality].nlargest(5) for nationality in distinctiveness_scores_nationality.index}
# Set global parameters for font sizes
plt.rcParams.update({'axes.titlesize': 20,
'axes.labelsize': 18, # X and Y labels font size
'xtick.labelsize': 16,
'ytick.labelsize': 16,
'legend.fontsize': 14})
selected_nationalities = list(top_distinct_topics_nationality.keys())[:9]
# Creating subplots in a 3x3 grid
fig, axes = plt.subplots(3, 3, figsize=(45, 40))
# Flattening the axes array for easier iteration
axes = axes.flatten()
# Plotting the distinctiveness scores for each selected nationality
for i, nationality in enumerate(selected_nationalities):
topics_data = top_distinct_topics_nationality[nationality].head(5).reset_index()
topics_data.columns = ['Topic', 'Distinctiveness Score']
topics_data['Topic Number'] = topics_data['Topic'].str.extract(r'(\d+)')
# Map the 'Topic Number' to the corresponding labels and include the topic number
topics_data['Topic Label'] = topics_data['Topic Number'].apply(lambda x: f'Topic {x}: ' + topic_labels.get(f'Topic {x}', f'Topic {x}'))
# Assign 'Topic Label' to hue and disable legend
sns.barplot(x='Distinctiveness Score', y='Topic Label', data=topics_data, ax=axes[i], palette="Blues_d", hue='Topic Label', legend=False)
axes[i].set_title(f'Top 5 Topics for {nationality}')
axes[i].set_xlabel('Distinctiveness Score')
axes[i].set_ylabel('')
legend = axes[i].get_legend()
if legend:
legend.remove()
fig.subplots_adjust(wspace=1.5)
plt.show()
# Function to find the texts with the highest contribution to a given topic for each nationality
def find_representative_texts_nationality(topics, num_texts=3):
representative_texts = {}
for nationality, topics_scores in topics.items():
representative_texts[nationality] = {}
for topic in topics_scores.index:
top_texts = df.sort_values(by=topic, ascending=False)[['title', 'author', topic]].head(num_texts)
representative_texts[nationality][topic] = top_texts.to_dict(orient='records')
return representative_texts
# Finding representative texts for each of the top topics of each nationality
representative_texts_nationality = find_representative_texts_nationality(top_distinct_topics_nationality)
representative_texts_nationality
{'American': {'Topic 44': [{'title': 'The Tell-Tale Heart',
'author': 'Poe, Edgar Allan',
'Topic 44': 94.58},
{'title': 'In Search of the Unknown',
'author': 'Chambers, Robert William',
'Topic 44': 1.64},
{'title': 'The Narrative Of Arthur Gordon Pym Of Nantucket',
'author': 'Poe, Edgar Allan',
'Topic 44': 1.55}],
'Topic 40': [{'title': "Fancy's Show-Box (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 40': 50.5},
{'title': 'The Italian, Or, The Confessional Of The Black Penitents. A Romance',
'author': 'Radcliffe, Ann',
'Topic 40': 5.33},
{'title': "Snow Flakes (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 40': 2.66}],
'Topic 28': [{'title': 'The Oval Portrait',
'author': 'Poe, Edgar Allan',
'Topic 28': 96.28},
{'title': 'The Phantom Rickshaw, and Other Ghost Stories',
'author': 'Kipling, Rudyard',
'Topic 28': 14.09},
{'title': 'In Search of the Unknown',
'author': 'Chambers, Robert William',
'Topic 28': 4.74}],
'Topic 41': [{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 41': 9.01},
{'title': 'A Thin Ghost and Others',
'author': 'James, Montague Rhodes',
'Topic 41': 1.5},
{'title': "The Seven Vagabonds (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 41': 1.23}],
'Topic 20': [{'title': "The Seven Vagabonds (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 20': 7.63},
{'title': "The Three Golden Apples\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 20': 1.41},
{'title': 'The Raven', 'author': 'Poe, Edgar Allan', 'Topic 20': 0.99}]},
'American-English': {'Topic 1': [{'title': 'The Real Right Thing',
'author': 'James, Henry',
'Topic 1': 14.87},
{'title': 'Dracula', 'author': 'Stoker, Bram', 'Topic 1': 1.64},
{'title': 'The Lady of the Lake',
'author': 'Scott, Walter',
'Topic 1': 1.51}],
'Topic 50': [{'title': 'Alonzo The Brave And Fair Imogine',
'author': 'Lewis, Matthew',
'Topic 50': 96.04},
{'title': "The Monkey'S Paw",
'author': 'Jacobs, William Wymark',
'Topic 50': 25.89},
{'title': "The Monkey's Paw\nThe Lady of the Barge and Others, Part 2.",
'author': 'Jacobs, William Wymark',
'Topic 50': 22.41}],
'Topic 22': [{'title': 'Hauntings',
'author': 'Lee, Vernon',
'Topic 22': 46.06},
{'title': 'Hauntings', 'author': 'Lee, Vernon', 'Topic 22': 1.01},
{'title': 'Arthur Mervyn; Or, Memoirs Of The Year 1793',
'author': 'Brown, Charles Brockden',
'Topic 22': 0.85}],
'Topic 14': [{'title': 'Superstition: An Ode',
'author': 'Radcliffe, Ann',
'Topic 14': 55.18},
{'title': 'Christabel',
'author': 'Coleridge, Samuel Taylor',
'Topic 14': 51.03},
{'title': 'The Vampire',
'author': 'Planché, James Robinson',
'Topic 14': 8.33}],
'Topic 51': [{'title': 'The Willows',
'author': 'Blackwood, Algernon',
'Topic 51': 56.27},
{'title': 'A Sicilian Romance',
'author': 'Radcliffe, Ann',
'Topic 51': 26.12},
{'title': "The Abbot's Ghost, or Maurice Treherne's Temptation: A Christmas Story",
'author': 'Barnard, A. M.',
'Topic 51': 25.76}]},
'Canadian': {'Topic 33': [{'title': 'In a Glass Darkly',
'author': 'Le Fanu, Sheridan',
'Topic 33': 17.42},
{'title': 'The Lane That Had No Turning',
'author': 'Parker, Gilbert',
'Topic 33': 1.73},
{'title': 'The House Of The Seven Gables',
'author': 'Hawthorne, Nathaniel',
'Topic 33': 1.42}],
'Topic 58': [{'title': 'The Lady of the Shroud',
'author': 'Stoker, Bram',
'Topic 58': 8.29},
{'title': 'The Lancashire Witches: A Romance of Pendle Forest',
'author': 'Ainsworth, William Harrison',
'Topic 58': 3.76},
{'title': 'The Phantom Rickshaw, and Other Ghost Stories',
'author': 'Kipling, Rudyard',
'Topic 58': 3.41}],
'Topic 63': [{'title': 'The Mystery Of Edwin Drood',
'author': 'Dickens, Charles',
'Topic 63': 5.78},
{'title': 'The Monk. A Romance',
'author': 'Lewis, Matthew',
'Topic 63': 4.94},
{'title': 'The Castle Of Wolfenbach',
'author': 'Parsons, Eliza',
'Topic 63': 4.69}],
'Topic 27': [{'title': "The Sister Years (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 27': 10.36},
{'title': 'The Beetle: A Mystery',
'author': 'Marsh, Richard',
'Topic 27': 2.36},
{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 27': 2.3}],
'Topic 56': [{'title': "Edward Fane's Rosebud (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 56': 55.99},
{'title': "Beneath an Umbrella (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 56': 22.23},
{'title': "The Paradise of Children\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 56': 20.7}]},
'English': {'Topic 22': [{'title': 'Hauntings',
'author': 'Lee, Vernon',
'Topic 22': 46.06},
{'title': 'Hauntings', 'author': 'Lee, Vernon', 'Topic 22': 1.01},
{'title': 'Arthur Mervyn; Or, Memoirs Of The Year 1793',
'author': 'Brown, Charles Brockden',
'Topic 22': 0.85}],
'Topic 50': [{'title': 'Alonzo The Brave And Fair Imogine',
'author': 'Lewis, Matthew',
'Topic 50': 96.04},
{'title': "The Monkey'S Paw",
'author': 'Jacobs, William Wymark',
'Topic 50': 25.89},
{'title': "The Monkey's Paw\nThe Lady of the Barge and Others, Part 2.",
'author': 'Jacobs, William Wymark',
'Topic 50': 22.41}],
'Topic 14': [{'title': 'Superstition: An Ode',
'author': 'Radcliffe, Ann',
'Topic 14': 55.18},
{'title': 'Christabel',
'author': 'Coleridge, Samuel Taylor',
'Topic 14': 51.03},
{'title': 'The Vampire',
'author': 'Planché, James Robinson',
'Topic 14': 8.33}],
'Topic 38': [{'title': 'The Banished Man',
'author': 'Smith, Charlotte',
'Topic 38': 25.16},
{'title': 'The Castle Of Wolfenbach',
'author': 'Parsons, Eliza',
'Topic 38': 17.61},
{'title': 'The Emigrants',
'author': 'Smith, Charlotte',
'Topic 38': 12.88}],
'Topic 37': [{'title': "Old Saint Paul's: A Tale of the Plague and the Fire",
'author': 'Ainsworth, William Harrison',
'Topic 37': 11.91},
{'title': 'The Fortunes Of Perkin Warbeck. A Romance',
'author': 'Shelley, Mary',
'Topic 37': 8.56},
{'title': 'The Black Cat',
'author': 'Poe, Edgar Allan',
'Topic 37': 6.71}]},
'English-Australian': {'Topic 44': [{'title': 'The Tell-Tale Heart',
'author': 'Poe, Edgar Allan',
'Topic 44': 94.58},
{'title': 'In Search of the Unknown',
'author': 'Chambers, Robert William',
'Topic 44': 1.64},
{'title': 'The Narrative Of Arthur Gordon Pym Of Nantucket',
'author': 'Poe, Edgar Allan',
'Topic 44': 1.55}],
'Topic 64': [{'title': "The Paradise of Children\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 64': 25.22},
{'title': 'The Princess and the Goblin',
'author': 'MacDonald, George',
'Topic 64': 22.16},
{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 64': 21.68}],
'Topic 33': [{'title': 'In a Glass Darkly',
'author': 'Le Fanu, Sheridan',
'Topic 33': 17.42},
{'title': 'The Lane That Had No Turning',
'author': 'Parker, Gilbert',
'Topic 33': 1.73},
{'title': 'The House Of The Seven Gables',
'author': 'Hawthorne, Nathaniel',
'Topic 33': 1.42}],
'Topic 54': [{'title': 'Woodstock; or, the Cavalier',
'author': 'Scott, Walter',
'Topic 54': 37.72},
{'title': "The Damned Thing\n1898, From 'In the Midst of Life'",
'author': 'Bierce, Ambrose',
'Topic 54': 13.95},
{'title': 'Varney The Vampire',
'author': 'Rymer, James Malcolm',
'Topic 54': 4.58}],
'Topic 4': [{'title': 'Berenice',
'author': 'Poe, Edgar Allan',
'Topic 4': 84.55},
{'title': 'What Was It? A Mystery',
'author': "O'Brien, Fitz-James",
'Topic 4': 84.03},
{'title': "Edward Randolph'S Portrait",
'author': 'Hawthorne, Nathaniel',
'Topic 4': 24.45}]},
'French-British': {'Topic 24': [{'title': "The Miraculous Pitcher\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 24': 5.78},
{'title': 'Frankenstein; Or, The Modern Prometheus',
'author': 'Shelley, Mary',
'Topic 24': 3.86},
{'title': 'Trilby', 'author': 'du Maurier, George', 'Topic 24': 1.83}],
'Topic 43': [{'title': 'Isabella, Or The Pot Of Basil',
'author': 'Keats, John',
'Topic 43': 6.78},
{'title': 'The Minstrel, Or The Progress Of Genius. A Poem',
'author': 'Beattie, James',
'Topic 43': 6.15},
{'title': 'Hauntings', 'author': 'Lee, Vernon', 'Topic 43': 4.82}],
'Topic 37': [{'title': "Old Saint Paul's: A Tale of the Plague and the Fire",
'author': 'Ainsworth, William Harrison',
'Topic 37': 11.91},
{'title': 'The Fortunes Of Perkin Warbeck. A Romance',
'author': 'Shelley, Mary',
'Topic 37': 8.56},
{'title': 'The Black Cat', 'author': 'Poe, Edgar Allan', 'Topic 37': 6.71}],
'Topic 55': [{'title': 'Hauntings',
'author': 'Lee, Vernon',
'Topic 55': 10.1},
{'title': 'The Adventure Of The German Student',
'author': 'Irving, Washington',
'Topic 55': 9.87},
{'title': 'Northanger Abbey', 'author': 'Austen, Jane', 'Topic 55': 5.66}],
'Topic 4': [{'title': 'Berenice',
'author': 'Poe, Edgar Allan',
'Topic 4': 84.55},
{'title': 'What Was It? A Mystery',
'author': "O'Brien, Fitz-James",
'Topic 4': 84.03},
{'title': "Edward Randolph'S Portrait",
'author': 'Hawthorne, Nathaniel',
'Topic 4': 24.45}]},
'Irish': {'Topic 33': [{'title': 'In a Glass Darkly',
'author': 'Le Fanu, Sheridan',
'Topic 33': 17.42},
{'title': 'The Lane That Had No Turning',
'author': 'Parker, Gilbert',
'Topic 33': 1.73},
{'title': 'The House Of The Seven Gables',
'author': 'Hawthorne, Nathaniel',
'Topic 33': 1.42}],
'Topic 4': [{'title': 'Berenice',
'author': 'Poe, Edgar Allan',
'Topic 4': 84.55},
{'title': 'What Was It? A Mystery',
'author': "O'Brien, Fitz-James",
'Topic 4': 84.03},
{'title': "Edward Randolph'S Portrait",
'author': 'Hawthorne, Nathaniel',
'Topic 4': 24.45}],
'Topic 61': [{'title': 'The Lady of the Shroud',
'author': 'Stoker, Bram',
'Topic 61': 33.12},
{'title': 'Told After Supper',
'author': 'Jerome, Jerome Klapka',
'Topic 61': 20.68},
{'title': 'The Lancashire Witches: A Romance of Pendle Forest',
'author': 'Ainsworth, William Harrison',
'Topic 61': 5.82}],
'Topic 39': [{'title': "The Damned Thing\n1898, From 'In the Midst of Life'",
'author': 'Bierce, Ambrose',
'Topic 39': 22.45},
{'title': 'Salome', 'author': 'Wilde, Oscar', 'Topic 39': 11.32},
{'title': 'The Empty House and Other Ghost Stories',
'author': 'Blackwood, Algernon',
'Topic 39': 9.67}],
'Topic 58': [{'title': 'The Lady of the Shroud',
'author': 'Stoker, Bram',
'Topic 58': 8.29},
{'title': 'The Lancashire Witches: A Romance of Pendle Forest',
'author': 'Ainsworth, William Harrison',
'Topic 58': 3.76},
{'title': 'The Phantom Rickshaw, and Other Ghost Stories',
'author': 'Kipling, Rudyard',
'Topic 58': 3.41}]},
'Scottish': {'Topic 54': [{'title': 'Woodstock; or, the Cavalier',
'author': 'Scott, Walter',
'Topic 54': 37.72},
{'title': "The Damned Thing\n1898, From 'In the Midst of Life'",
'author': 'Bierce, Ambrose',
'Topic 54': 13.95},
{'title': 'Varney The Vampire',
'author': 'Rymer, James Malcolm',
'Topic 54': 4.58}],
'Topic 59': [{'title': 'The Princess and the Goblin',
'author': 'MacDonald, George',
'Topic 59': 7.3},
{'title': 'The House of Souls',
'author': 'Machen, Arthur',
'Topic 59': 5.62},
{'title': 'The Invaders', 'author': 'Ferris, Benjamin', 'Topic 59': 5.41}],
'Topic 72': [{'title': 'A Beleaguered City, Being A Narrative Of Certain Recent Events In The City Of Semur, In The Department Of The Haute Bourgogne. A Story Of The Seen And The Unseen:',
'author': 'Oliphant, Margaret',
'Topic 72': 7.74},
{'title': 'The Death Of Halpin Frayser',
'author': 'Bierce, Ambrose',
'Topic 72': 7.58},
{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 72': 7.42}],
'Topic 64': [{'title': "The Paradise of Children\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 64': 25.22},
{'title': 'The Princess and the Goblin',
'author': 'MacDonald, George',
'Topic 64': 22.16},
{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 64': 21.68}],
'Topic 41': [{'title': "Chippings with a Chisel (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 41': 9.01},
{'title': 'A Thin Ghost and Others',
'author': 'James, Montague Rhodes',
'Topic 41': 1.5},
{'title': "The Seven Vagabonds (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 41': 1.23}]},
'Welsh': {'Topic 53': [{'title': "Old Saint Paul's: A Tale of the Plague and the Fire",
'author': 'Ainsworth, William Harrison',
'Topic 53': 9.87},
{'title': "Sights from a Steeple (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 53': 8.94},
{'title': "Snow Flakes (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 53': 8.57}],
'Topic 12': [{'title': 'La Belle Dame Sans Merci',
'author': 'Keats, John',
'Topic 12': 90.03},
{'title': "Sunday at Home (From 'Twice Told Tales')",
'author': 'Hawthorne, Nathaniel',
'Topic 12': 78.15},
{'title': "The Monkey'S Paw",
'author': 'Jacobs, William Wymark',
'Topic 12': 44.25}],
'Topic 7': [{'title': 'The Black Cat',
'author': 'Poe, Edgar Allan',
'Topic 7': 43.06},
{'title': "The Paradise of Children\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 7': 23.57},
{'title': "The Gorgon's Head\n(From: 'A Wonder-Book for Girls and Boys')",
'author': 'Hawthorne, Nathaniel',
'Topic 7': 8.73}],
'Topic 33': [{'title': 'In a Glass Darkly',
'author': 'Le Fanu, Sheridan',
'Topic 33': 17.42},
{'title': 'The Lane That Had No Turning',
'author': 'Parker, Gilbert',
'Topic 33': 1.73},
{'title': 'The House Of The Seven Gables',
'author': 'Hawthorne, Nathaniel',
'Topic 33': 1.42}],
'Topic 26': [{'title': 'In Search of the Unknown',
'author': 'Chambers, Robert William',
'Topic 26': 3.84},
{'title': 'Tanglewood Tales',
'author': 'Hawthorne, Nathaniel',
'Topic 26': 3.72},
{'title': 'Tales of Men and Ghosts',
'author': 'Wharton, Edith',
'Topic 26': 0.98}]}}
The overwhelming majority of the contributions of distinctly American voices seem to be bound to the strongly masculine topics about poise, but also the one about Romanticism we had previously uncovered, with an overwhelming influence being Poe, Chambers, Brown, and Hawthorne, even if the most highly associated one of them somehow ranking "The Tell-Tale-Heart", which quizzically subverts the posed expectations.
The distinctly British voices carry a much stronger weight than any of the other nationalities, with two of them arising from the list of distinctly female topics: "22 - Emotional Dynamics and Interactions" and "38 - Psychology, Trauma, and Secrets", while 38 has a very dense rate of Mary Shelly and Ann Radcliffe texts, 22 is a very diverse topic in terms of authors contributing to it, but the topic carries a strong heterogeneity concerning nationality. As mentioned above, it carries with it a lot of strongly passionate vocabularies like "breathless, hug, vociferating, moan, ruffled, brazen" with the highest contribution by Lee Vernon's "Hauntings" or Godwin's "The Adventures of Caleb Williams".
Connection between sentiment and different topics:¶
all_topics_sentiment_correlations = df[all_topics + ['sentiment']].corr()['sentiment'].drop('sentiment')
# Selecting the 15 topics with the strongest absolute correlation (considering both positive and negative)
strongest_absolute_correlations = all_topics_sentiment_correlations.abs().nlargest(15)
selected_strongest_absolute_topics = strongest_absolute_correlations.index.tolist()
# Creating a list of labels with topic numbers and labels for the selected topics using the mapping dictionary
selected_labels_with_numbers = [f'{topic}: {topic_labels.get(topic, "Label not found")}' for topic in selected_strongest_absolute_topics]
# Visualization of the correlation for these 15 topics
plt.figure(figsize=(10, 6))
all_topics_sentiment_correlations[selected_strongest_absolute_topics].plot(kind='bar', color='purple')
# Replace the X-axis tick labels with the selected labels
plt.xticks(range(len(selected_labels_with_numbers)), selected_labels_with_numbers, rotation=90)
plt.title('Top 15 Topics with Strongest Absolute Correlation to Sentiment')
plt.xlabel('Topics')
plt.ylabel('Correlation with Sentiment')
plt.grid(axis='y')
# Show the plot
plt.show()
The connection between sentiment and different topics is not particularly strong, but in those cases where it is present the connection seems very natural and intuitive, with texts that are strongly towards topics that mark carnage, crime, death and tense judgments being leaning towards a more negative sentiment, while those focusing on self-expression, ambition, intimacy seduction leaning towards a positive sentiment.
But given that only three entries have a higher value than 0,1 the connections are not overly strong to begin with.
Distribution of Topics among Periods, Text Sources and Roles¶
df_per = df_txt_features_LDA.copy()
df_per['date'] = pd.to_numeric(df_per['date'], errors='coerce')
df_per = df_per.dropna(subset=['date'])
# Extract the decade from the 'date' and create a new column for it
df_per['decade'] = (df_per['date'] // 10 * 10).astype(int)
# Define the relevant topics as specified
relevant_topics = [f"Topic {i}" for i in range(1, 21)] + ["Topic 70", "Topic 65", "Topic 51", "Topic 45", "Topic 38", "Topic 34"]
decade_grouped = df_per.groupby('decade')[topic_columns].mean()
# Identifying topics that have a peak value greater than 8
peaking_topics = [topic for topic in topic_columns if decade_grouped[topic].max() > 8]
agg_topics_by_decade_role = df_per.groupby(['decade', 'role'])[topic_columns].mean().reset_index()
# Filter the relevant topics list to include only those peaking topics
filtered_relevant_topics = [topic for topic in relevant_topics if topic in peaking_topics]
palette = sns.color_palette("husl", n_colors=len(filtered_relevant_topics))
# Creating a facet grid for the filtered topics, overlaying all topics in each linechart
g = sns.FacetGrid(agg_topics_by_decade_role, col="role", col_wrap=3, height=4, sharey=False, palette="viridis")
for i, topic in enumerate(filtered_relevant_topics):
# Map the topic number to its label
topic_label = f'{topic}: {topic_labels.get(topic, "Label not found")}'
g = g.map_dataframe(sns.lineplot, x="decade", y=topic, color=palette[i], label=topic_label)
# Add a legend with the topic labels instead of the numbers
g.add_legend(title="Topics")
# Adjust the legend to display full topic labels if necessary
for text, topic in zip(g._legend.texts, filtered_relevant_topics):
text.set_text(f'{topic}: {topic_labels.get(topic, "Label not found")}')
g._legend.set_bbox_to_anchor((1.05, 0.5))
plt.setp(g._legend.get_texts(), linespacing=2)
g.set_axis_labels("Decade", "Average Topic Weight")
g.set_titles(col_template="{col_name} Role")
plt.show()
This only strengthens the impression of how central 5, 51, 70, and 65 are, even 3 had faded into the background in previous comparisons.
agg_topics_by_decade_period = df_per.groupby(['decade', 'period'])[topic_columns].mean().reset_index()
palette = sns.color_palette("husl", n_colors=len(filtered_relevant_topics))
# Creating a facet grid for each period, overlaying all filtered relevant topics in each linechart
g = sns.FacetGrid(agg_topics_by_decade_period, col="period", col_wrap=2, height=4, sharey=False, palette=palette)
# Map each topic to a line in the grid and include both the topic number and label in the legend
for i, topic in enumerate(filtered_relevant_topics):
topic_label = f'{topic}: {topic_labels.get(topic, "Label not found")}'
g = g.map_dataframe(sns.lineplot, x="decade", y=topic, color=palette[i], label=topic_label)
g.add_legend(title="Topics")
g._legend.set_bbox_to_anchor((1.05, 0.5))
plt.setp(g._legend.get_texts(), linespacing=2)
g.set_axis_labels("Decade", "Average Topic Weight")
g.set_titles(col_template="{col_name} Period")
plt.show()
# Aggregating the topic distributions by decade and source
agg_topics_by_decade_source = df_per.groupby(['decade', 'source'])[topic_columns].mean().reset_index()
# Creating a facet grid for each source, overlaying all filtered relevant topics in each linechart
g = sns.FacetGrid(agg_topics_by_decade_source, col="source", col_wrap=2, height=4, sharey=False, palette=palette)
# Map each topic to a line in the grid and include both the topic number and label in the legend
for i, topic in enumerate(filtered_relevant_topics):
topic_label = f'{topic}: {topic_labels.get(topic, "Label not found")}'
g = g.map_dataframe(sns.lineplot, x="decade", y=topic, color=palette[i], label=topic_label)
# Adjusting plot labels and adding a legend for different topics
g.add_legend(title="Topics")
g.set_axis_labels("Decade", "Average Topic Weight")
g.set_titles(col_template="{col_name} Source")
plt.show()
Cluster Analysis¶
The relationship between the texts concerning their topic distributions as features will be examined for its underlying composition. For this, Principal component analysis is used on the topic columns and K-Means clustering on the results to group them into categories
df_clu = df_txt_features_LDA.copy()
topic_columns = [col for col in df_clu.columns if col.startswith('Topic')]
# Selecting only the topic distribution columns for clustering
topic_data = df_clu[topic_columns]
# Using PCA for dimensionality reduction
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(topic_data)
# Applying K-means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(reduced_data)
labels = kmeans.predict(reduced_data)
df_clu['cluster'] = labelsPCA-reduced Topic Data with K-means Clusters
# Plotting the results
plt.figure(figsize=(12, 8))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=labels, cmap='viridis', marker='o')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=300, alpha=0.6)
plt.title('PCA-reduced Topic Data with K-means Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()
/Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:753: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any(): /Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:591: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. if is_sparse(pd_dtype): /Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:600: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
# Plotting the topic distributions for each cluster
fig, axs = plt.subplots(nrows=5, ncols=1, figsize=(15, 20))
for i in range(5):
cluster_data = df_clu[df_clu['cluster'] == i][topic_columns].mean()
axs[i].bar(x=cluster_data.index, height=cluster_data.values)
axs[i].set_title(f'Cluster {i+1} Topic Distribution')
axs[i].set_ylabel('Average Topic Weight')
axs[i].tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()
A closer examination of the outliers
# Counting the number of texts in each cluster
cluster_counts = df_clu['cluster'].value_counts()
# Creating a dictionary with clusters as keys and references as values
cluster_references_dict = df_clu.groupby('cluster')['reference'].apply(list).to_dict()
# Filtering the dictionary to include only the three smallest clusters
# Sorting clusters by their size to identify the three smallest
smallest_clusters = cluster_counts.nsmallest(3).index
smallest_cluster_references = {cluster+1: cluster_references_dict[cluster] for cluster in smallest_clusters}
# Printing the references for the three smallest clusters
smallest_cluster_references
{4: ['Hawthorne_SundayatHo_1',
'Jacobs_TheMonkeyS_1',
'Jacobs_TheMonkeys_1',
'Keats_LaBelleDam_1'],
3: ['Aikin_SirBertran_1',
'Hawthorne_LittleAnni_1',
'Hawthorne_TheLilysQu_1',
'Hawthorne_TheMiniste_1',
'Hawthorne_TheWhiteOl_1'],
5: ['Bangs_GhostsIHav_1',
'Bierce_AnOccurren_1',
'Holcroft_AnnaStIves_1',
'Lewis_AlonzoTheB_1',
'Machen_TheHouseof_1',
'Stagg_TheVampyre_1',
'Stoker_TheLairOfT_1',
'unsigned_CountRoder_1']}
The grouping into clusters shows an even distribution of topics into two groups with a smaller third party that has a much narrower distribution that focuses on a few specific topics, but their narrower focus is concentrated on some that carry exceptional weight on the corpus as a whole: 70 "Myth and splendor - Wealth and Castles" and 12 "Home Invasion - Domestic Mystery and Conflict", with 12 being particularly focused in its influence on a select few influential authors that stand apart:
The importance of 70 might on the other hand reflect its weight on some of the major voices within the corpus, such as Hawthorne and Marie Corelli.
Further important influences to investigate are topics 49 - "Departure and Music", 50 - "Myth, Nature, Wonder and Despair" and to a lesser degree 51 - "Disillusionment with Society - \n Resistance, Protest, Retreat".
A closer inspection is warranted for these sparse clusters with only a few entries for 35 - "Mental Illness, Law and Outcasts - Fear, Suspicion and Struggles", 36 - "Individualism vs. Conformity - Rebellion and Social Norms", 52 - "Adventure, Spendor, Power and Challenges, History" and a lot of weight onto 70, "Myth and splendor - Wealth and Castles".
A subsection of the text seems to be dealing heavily with topics focused on societal retreat, solitude, personal autonomy, and rebellion for the sake of one's convictions. But there seems to be a split within the interpretation of those topics grouping them into a section, one about adventure, exploration, marveling at discoveries, and forgotten splendor. Meanwhile, the other grouping of texts is equally disillusioned and in opposition to or active departure from society, but does not enjoy what it finds and is haunted by foreign forces that bring conflict and grief.
Jacobs's "Monkey's Paw" and Keats's "La Belle Dame sans Merci," both tell a tale of a tempting encounter with an alluring other, a magical artifact and a fairy, and detail the anguish that their wants brought them.
Similarly, Machens the House of Souls is a collection of short texts, the most prominent of them "The Inmost Light" "The Great God Pan" and "The White People" which deal with humans that cross the veil of what was for their kind to perceive and experience and the disturbing or corrupting experiences that ensued.
Meanwhile Hawthorne's "The Minister's Black Veil” tells the story of a man of faith turning away from life in his community and his old life, only to rise in esteem, influence, and power through his renouncement of personal connection. "Sunday at Home" is an ambiguous text about worship and community and a mixture of longing and contempt for a church congregation.
Hierarchical Clustering¶
Hierarchical clustering of topics based on their distribution similarity.
# Cmpute Jensen-Shannon divergence, the metric also in use in pyLDAVis
def jensen_shannon_divergence(p, q):
"""
Compute the Jensen-Shannon divergence between two probability distributions.
"""
p = np.asarray(p)
q = np.asarray(q)
m = (p + q) / 2
return (entropy(p, m) + entropy(q, m)) / 2
dist_matrix = pdist(topic_term_dists_LDA, metric=jensen_shannon_divergence)
linkage_matrix = linkage(dist_matrix, method='average')
plt.figure(figsize=(15, 10))
dendrogram(linkage_matrix, labels=range(1, len(topic_term_dists_LDA) + 1))
plt.title('Hierarchical Clustering of Topics')
plt.xlabel('Topic')
plt.ylabel('Distance')
plt.show()
Here the grouping seems to create two outgroups composed of 31 - "Exploration, Gloom, Caverns", a very niche topic with little weight on the larger whole, and 29 - "Bickering, Fighting and Mountains", a highly concentrated topic with impact on the wider grouping and 60 - "Confession and marriage before Conscription and Battle" also a very niche topic with little weight to it.
The other outgroup cluster is composed of 65 - "Atmospheric Battle Descriptions and Royalty", 51 - "Disillusionment with Society - Resistance, Protest, Retreat", 5 - "Excitability, Madness and Deceit - Aggression, conflict and Glee" and 38 - "Psychology, Trauma and Secrets". The latter poses a very powerful group of topics that carry a large weight on the corpus within a small selection.
Correlation Heatmap¶
The following correlation matrix covers the attributes 'gender', 'nationality', 'source', 'sentiment', 'period', 'mode', 'genre','role', 'cluster' and a binary indication of high or low topic values for each topic
# Calculating the median for each topic and creating binary variables for high/low values
topic_medians = df_clu[topic_columns].median()
for topic in topic_columns:
df_clu[f'{topic}_high'] = df_clu[topic] > topic_medians[topic]
columns_of_interest = ['gender','nationality', 'source', 'sentiment',
'period', 'mode', 'genre', 'role', 'cluster'] + [f'{topic}_high' for topic in topic_columns]
analysis_df = df_clu[columns_of_interest]
# Converting categorical variables to dummy variables for correlation analysis
analysis_df_dummies = pd.get_dummies(analysis_df, columns=['gender', 'nationality', 'source', 'period', 'mode', 'genre', 'role'])
correlation_matrix = analysis_df_dummies.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(correlation_matrix, cmap='coolwarm', square=True)
plt.title('Correlation Matrix for Demographics, Polarity, Clusters and High Topic Values')
plt.show()
Network Analysis:¶
The following network analysis deals with influence among texts, with the intention of establishing influence and similarity among authors. For this pairwise similarity between documents based on topic distributions cosine similarity is used.
Nodes will represent the documents. Edges will represent the similarity between documents, potentially with a threshold to filter out low-similarity connections.
Network Analysis: Analyze the network to find clusters of similar texts, centrality measures, and other network characteristics.
df_net = df_txt_features_LDA.copy()
topic_columns = [col for col in df_net.columns if col.startswith('Topic ')]
random.seed(3)
np.random.seed(3)
Network of overall similarity of texts¶
# Group by 'author' and calculate the mean for each topic column
author_topics = df_net.groupby('author')[topic_columns].mean().reset_index()
# Recalculate the cosine similarities based on the averaged topic distributions
similarity_matrix = cosine_similarity(author_topics[topic_columns])
# Since the similarity matrix is symmetric, the diagonal is filled up with np.nan to avoid self-loops
np.fill_diagonal(similarity_matrix, np.nan)
similarity_threshold = 0.85
G = nx.Graph()
# Add nodes to the graph, using authors as the node identifier
for idx, row in author_topics.iterrows():
G.add_node(row['author'], author=row['author'])
# Add edges based on the similarity threshold and the averaged topic distribution
for i in range(len(similarity_matrix)):
for j in range(i+1, len(similarity_matrix)):
if similarity_matrix[i][j] >= similarity_threshold:
author_i = author_topics.iloc[i]['author']
author_j = author_topics.iloc[j]['author']
G.add_edge(author_i, author_j, weight=similarity_matrix[i][j])
node_sizes = [10 * G.degree(n) for n in G.nodes()]
edges = G.edges()
weights = [G[u][v]['weight'] for u,v in edges]
labels = {author: author for author in author_topics['author']}
# Label only the most central nodes to reduce label overlap
degree_dict = dict(G.degree(G.nodes()))
central_nodes = [node for node in degree_dict if degree_dict[node] >= np.percentile(list(degree_dict.values()), 50)] # Adjust threshold as needed
central_labels = {node: labels[node] for node in central_nodes}
# Use the Spring layout for a more spread out layout
pos = nx.spring_layout(G, k=0.20, iterations=20, seed=4)
plt.figure(figsize=(15, 15))
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, alpha=0.7)
nx.draw_networkx_edges(G, pos, edgelist=edges, width=weights, alpha=0.2)
nx.draw_networkx_labels(G, pos, labels=central_labels, font_size=6)
plt.title("Network of Authors' Influence Based on Averaged Topic Distributions")
plt.axis('off')
plt.show()
The network shows a few clear centers of similarity and influence:
The biggest collection of influential nodes is a grouping composed of Mary Shelly, William Godwin, Frances Burney, and Charles Brockden Brown, with several smaller authors surrounding them. Their influence and central position is reflected in their topic distribution as well, Brown, Godwin and Shelly represent all the most central and influential topics within the corpus, making them pivotal in shaping the gothic fiction genre. Their central location and the number of connections imply that they could have had a considerable influence on their contemporaries and possibly on those who followed.
Another center is composed of Percy Shelly, Horace Walpole, Elenor Sleath, and Thomas Leland. Making for a very early group of authors, all active before 1800, carrying a pioneering position.
A smaller and less densely connected grouping covers Sheridan le Fanu, the Bronte sisters, Elizabeth Gaskell and Corelli Marie, whose texts share topics with an emphasis on psychological exploration and internal struggles, potentially influenced by Romanticism.
Additional points of interest are, how 1 and 2 are connected through Regina Maria Roche. 2 and 3 are connected through Eaton Stannard Barret, and 1 and 3 through Hogg James and Beckford William.
Furthermore noteworthy is the fact that Hawthorne, who was arguably overrepresented in many other graphs, is absent here. By uniqueness of style, idiosyncrasy, or the result of the aggregation of such a broad range of topics. The same goes for Stoker and Ann Radcliffe.
Averaging the Distribution on the features of all text segments¶
# Group by 'text_key' and calculate the mean for each topic column
text_key_topics = df_net.groupby('text_key')[topic_columns].mean().reset_index()
# Recalculate the cosine similarities based on the averaged topic distributions
similarity_matrix = cosine_similarity(text_key_topics[topic_columns])
# Since the similarity matrix is symmetric, diagonal is filled with np.nan to avoid self-loops
np.fill_diagonal(similarity_matrix, np.nan)
similarity_threshold = 0.85
G = nx.Graph()
# Add nodes to the graph, using text_keys as the node identifier
for idx, row in text_key_topics.iterrows():
G.add_node(idx, text_key=row['text_key'])
# Add edges based on the similarity threshold and the averaged topic distribution
for i in range(len(similarity_matrix)):
for j in range(i+1, len(similarity_matrix)):
if similarity_matrix[i][j] >= similarity_threshold:
text_key_i = text_key_topics.iloc[i]['text_key']
text_key_j = text_key_topics.iloc[j]['text_key']
G.add_edge(text_key_i, text_key_j, weight=similarity_matrix[i][j])
node_sizes = [10 * G.degree(n) for n in G.nodes()]
edges = G.edges()
weights = [G[u][v]['weight'] for u, v in edges]
# Labels - using the 'text_key' as labels
labels = {row['text_key']: row['text_key'] for idx, row in df_net.iterrows()}
# Creating labels only for the most central nodes
degree_dict = dict(G.degree(G.nodes()))
central_nodes = [node for node in degree_dict if degree_dict[node] >= np.median(list(degree_dict.values()))]
# When creating central_labels, ensure that the node exists in labels
central_labels = {node: labels[node] for node in central_nodes if node in labels}
# Use the Spring layout for a more spread out layout
pos = nx.spring_layout(G, k=0.15, iterations=20, seed=5)
plt.figure(figsize=(15, 15))
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, alpha=0.7)
nx.draw_networkx_edges(G, pos, edgelist=edges, width=weights, alpha=0.2)
nx.draw_networkx_labels(G, pos, labels=central_labels, font_size=6)
plt.title("Network of Texts Based on Averaged Topic Distributions")
plt.axis('off')
plt.show()
Taking the length of texts and the number of contributions out of the picture, but potentially also lessening the weight an individual unique piece a contribution might carry, the averaged distribution shows a slightly different picture.
This network has many similarities with the previous one:
It moves Godwin's Caleb Williams into a centerpiece position connecting the first and the second group, while the works of Mary Shelly drift into the centers of all the major groupings. Pieces from Le Fanu, Gaskell, Shelly, and Lewis intermix with Roche's The Children of the Abby carrying particularly much weight and Brown's Edgard Huntly and Arthus Mervyn and DeQuincey's Klosterheim in a mix.
Firmly grouping Walpole, Percy Shelly, Elenore Sleath, and Thomas Leland in a shared circle of influence, and it shifts Mary Shelly's Frankenstein also into this cluster, with Eaton Barrett's The Heroine as a new outer centerpiece carrying a lot of traction. Here finally the figure of Ann Radcliffe emerges, positioning here centrally within the early pioneers of the genre.
The third smaller hub has largely fractured and has gotten reabsorbed, leaving Francis Burney's Carmilla as a centerpiece with some others, like Richard Burton's Vikram, Mary Shelly's Lodore, Brown's Wieland around him, but there are fewer circulating in its orbit.
At the outskirts of this orbit Machen and Blackwood have united again, as opposed to the other display, with Godwin's St. Leon and Carver's Horror of Oakendale connecting them with the circles of the third branch. Once again, Hoggs connects groups 2 and 3, but Beckford is absent and in his stead, Lytton and Polidori have shifted. This highlights the crystallization of a cluster of texts surrounding the intrusion of outsiders and forbidden knowledge.
Hawthorne's works appear on the fringes of the network disconnected from most other pieces, mainly self-referential. Stoker and Byron are largely absent as well, influencing most of all each other.
This grouping puts more emphasis on Mary Shelly and Willaim Godwin as a thematic bridge between different groupings of Gothic fiction authors, highlighting their thematic sway over the genre formation as a whole.
Tracing Influence across the network¶
For this similarity shall be evaluated between all sections of a text equally, only its most prominent topics will be compared with the rest of the network and similarity will only be evaluated unilaterally from the older to the newer texts.
unique_text_keys = df_net['text_key'].unique()
topic_columns = [col for col in df_net.columns if col.startswith('Topic')]
top_topics_list = []
for index, row in df_net.iterrows():
sorted_topics = row[topic_columns].sort_values(ascending=False).head(10)
top_topics_dict = sorted_topics.to_dict()
top_topics_list.append(top_topics_dict)
top_topics_df = pd.DataFrame(top_topics_list)
top_topics_df = top_topics_df.fillna(0)
# Function to calculate cosine similarity
def calculate_similarity(df):
matrix = df.to_numpy()
sim_matrix = cosine_similarity(matrix)
return sim_matrix
# Nullify all columns except the top ten topics for similarity calculation
similarity_matrix = calculate_similarity(top_topics_df)
G = nx.DiGraph()
# Add nodes with text_key as label and date as attribute
for text_key in unique_text_keys:
# Extract the date for this text_key
date = df_net[df_net['text_key'] == text_key]['date'].iloc[0]
G.add_node(text_key, date=date)
#Adding Edges
similarity_threshold = 0.75
# Creating a dictionary for quick access to text_key indices
text_key_to_index = {text_key: i for i, text_key in enumerate(unique_text_keys)}
# Iterate over each pair of text segments
for i, text_key1 in enumerate(unique_text_keys):
for j, text_key2 in enumerate(unique_text_keys):
if i != j:
# Check if similarity is above the threshold
if similarity_matrix[i, j] >= similarity_threshold:
# Determine the direction of the edge based on the date
date1 = df_net[df_net['text_key'] == text_key1]['date'].iloc[0]
date2 = df_net[df_net['text_key'] == text_key2]['date'].iloc[0]
if date1 < date2:
# Add edge from older text to newer text
G.add_edge(text_key1, text_key2, weight=similarity_matrix[i, j])
elif date1 == date2:
# Add bilateral edges for texts from the same year
G.add_edge(text_key1, text_key2, weight=similarity_matrix[i, j])
G.add_edge(text_key2, text_key1, weight=similarity_matrix[j, i])
degree_centrality = nx.in_degree_centrality(G)
# Sort nodes by degree centrality (highest centrality first)
sorted_nodes = sorted(G.nodes(), key=lambda node: degree_centrality[node], reverse=True)
# Create a mapping of numbers to sorted node references
node_labels = {node: i for i, node in enumerate(sorted_nodes)}
# Create a reverse mapping for the legend
label_to_node = {i: node for node, i in node_labels.items()}
node_sizes = [G.degree(node) * 100 for node in sorted_nodes]
Interpretation Variant¶
The following version relies solely on features inherent to the texts themselves, while the evaluation will be based on additional details.
centrality = nx.degree_centrality(G)
# Calculate the cumulative weight for edges where multiple connections exist
for u, v, data in G.edges(data=True):
# Since it's a directed graph, we need to check both directions
if G.has_edge(v, u):
total_weight = data['weight'] + G[v][u]['weight']
G[u][v]['weight'] = total_weight
G[v][u]['weight'] = total_weight
# Storing these measures as node attributes for later use
for node, centrality in degree_centrality.items():
G.nodes[node]['degree_centrality'] = centrality
partition = community_louvain.best_partition(G.to_undirected())
for node, comm_id in partition.items():
G.nodes[node]['community'] = comm_id
# Use colors for different communities
community_colors = [partition[node] for node in G.nodes()]
node_sizes = [v * 1000 for v in degree_centrality.values()]
centrality = nx.degree_centrality(G)
# Sort nodes by centrality (more central nodes get lower numbers)
sorted_nodes = sorted(G.nodes, key=lambda node: centrality[node], reverse=True)
# Assign numbers to nodes based on sorted order
numbered_labels = {node: i+1 for i, node in enumerate(sorted_nodes)}
# Use the spring layout for visualization
pos = nx.spring_layout(G, k=0.25, iterations=20, seed=42)
plt.figure(figsize=(20, 20))
nx.draw_networkx_edges(G, pos, alpha=0.2)
nx.draw_networkx_nodes(G, pos, node_size=[10 * G.degree(n) for n in G.nodes()], alpha=0.7)
# No need to draw labels here as we're adjusting their placement
# Create a two-column legend
sorted_legend_items = sorted(numbered_labels.items(), key=lambda item: item[1])
items_per_column = len(sorted_legend_items) // 2
# Initialize empty strings for each column of the legend
left_column_text = ""
right_column_text = ""
# Populate the column strings
for index, (node, number) in enumerate(sorted_legend_items):
entry = f"{number}: {node}\n"
if index < items_per_column:
left_column_text += entry
else:
right_column_text += entry
plt.subplots_adjust(left=0.2, right=0.8)
# Place the column strings on the plot
plt.figtext(0.02, 0.5, left_column_text, ha="left", fontsize=8, bbox={"facecolor":"orange", "alpha":0.5, "pad":5}, va='center')
plt.figtext(0.98, 0.5, right_column_text, ha="right", fontsize=8, bbox={"facecolor":"orange", "alpha":0.5, "pad":5}, va='center')
# Adjust label positions to avoid overlap with nodes
labels_pos = {node: (pos[node][0], pos[node][1] + 0.04) for node in G.nodes()} # Shift labels slightly above nodes
# Draw labels and use adjust_text to improve their placement
texts = []
for node, label_pos in labels_pos.items():
text = plt.text(label_pos[0], label_pos[1], str(numbered_labels[node]), ha='center', va='center', fontsize=8)
texts.append(text)
adjust_text(texts,autoalign='y',avoid_points=False,precision=0.01,
)
plt.title('Cosine Similarity with their predecessors - Influence on Gothic Fiction Texts', fontsize=24)
plt.axis('off')
plt.show()
The label adjustment has been set to only ever change positioning for the sake of visibility by moving vertically. In case of any uncertainty about which nodes are referred to - the closest one below the label, after a certain offset, with minor leeway for the sake of legibility.
This grouping, which takes chronology into consideration, brings a few previously unheard of texts to the forefront, but the larger picture strengthens the impression informed by previous network analysis. The fact that the labeling has been done in order of degree centrality allows for a more precise ranking than any of the previous attempts and takes the element of personal expectation, preferential investigation and subjective gaze out of the picture.
The shape of the network has drastically changed in comparison to the previous iterations providing one larger, loosely integrated cluster consisting of a number of sub groupings, as well as a scant few highly influential outliers which seemingly arose without reference to prior works of the genre, which nevertheless provided a rich source of inspiration for subsequent texts.
Upper outliers:
22: Edward Bulwer-Lytton's Falkland: An amoral piece of self-indulgence centered around the courting of a young woman by a protagonist tormented by premonitions of her death — inspired by the Sturm and Drang Goethe piece Sorrows of Young Werther.
69: Mary Shelly's The Fortunes Of Perkin Warbeck: The Fortunes of Perkin Warbeck is a piece of political intrigue set in the War of the Roses.
24: Mathew Lewis' The Castle Spectre: The text covers religious fanaticism and psychological mystery.
44: Walter Scott's The Black Dwarf: The Black Dwarf details the story of a misanthropic hermit dwarf in league with the devil amid political intrigue.
39: Edgar Allan Poe's The Fall of the House of Usher: The Fall of the House of Usher details the story of an isolated aristocratic family, madness, as well as feelings of fear, doom, and guilt.
40: Charles Dickens' The Haunted Man and the Ghost's Bargain: The text deals with a professor haunted by past mistakes and sorrows plagued by apparitions.
41: Elizabeth Gaskell's The Doom of the Griffiths: It covers family curses, myth, revenge, murder, and the role of women in society.
Upper Half:
38: Mary Shelley's The Last Man: It deals with isolation, global plague, disaster, medicine, science, and the loss of political ideas.
29: John Palmer's The Haunted Cavern. A Caledonian Tale: The Haunted Cavern covers specters, murder, gloomy caverns and a betrothal to a villain.
32: Eaton Barrett's The Heroine: A popular bestseller that combined the style of Radcliffe's gothic romances with an over the top comedic and quixotic adventure and exploration arc of a female protagonist, with a paternal call or domestic submissiveness at the end. This text had already appeared highly influential in the visualizations above, and numerous quotes on its reach and influence exist.
77: Charlotte Smith's The Emigrants: Emigrants deals with the trauma and alienation French emigrants after the revolution carry with them.
Upper Half-Lower: 42 + 36 + 53: Nathaniel Hawthorne's The Lily's Quest and The White Old Maid and Little Annie's Ramble: All three Hawthorne texts stem from the same collection of short stories. The Lily's Quest, deals with the cycle of life and death, the futility of the pursuit of happiness and the necessity of hardship. The White Old Maid deals with ominous ritual surrounding death, a love triangle and reunification. Little Annie's Ramble covers childhood innocence, curiosity, and the longing of an adult for simple joys past, benevolent unconsenting child abduction.
58: William Ainsworth's Rookwood: A gothic romance detailing royal inheritance disputes after an ill-fated death.
Lower outliers:
19: Wilkie Collin's The Woman in White: Collin's most highly esteemed piece of Sensationalist mystery fiction about the dispossession of a disenfranchised noblewoman accused of lunacy.
28: Montague James' Ghost Stories of an Antiquary: Deals with the veiled invasion of outside forces in the rural retreat of a scholar. (reminiscent of Sheridan or Machen)
52: Marcus Clark's For the Term of his Natural Life: It deals with the cruelty and systematic violence inflicted on a wrongfully convicted.
33: Algernon Blackwood's The Willows: Supernatural, threatening surroundings, intense dread and anxiety. (Strong influence on 20th-century weird fiction)
34: Richard Francis Burton's Vikram: An Indian tale of the hunt for a necromantic vampire.
Bottom Half up:
13: William Godwin's St. Leon: Godwin grapples with fallen nobility, madness, Faustian bargains, and a retreat from society in despair.
45: Mary Shelley's Lodore: Lodore covers power, responsibility, and the role of women within a family structure.
31: AM Bernard's The Abbot's Ghost: The Abbots Ghost is a tale of intrigue, ghost, forbidden love, and miracles.
23: Jane Austin's Northanger Abbey: A famous satire on Gothic Romance, with a too-romantic girl and a frightening backdrop.
30: Ann Radcliffe's A Sicilian Romance: Its themes are fallen nobility, shameful secrets, and psychological terror in a fallen castle.
Bottom Half Upper:
20: Percy Shelley's Zastrozzi: A piece lauded for its depiction of amoral self-indulgence and individualism within a backdrop of a gothic setting.
17: Reeve Sophia's The Mysterious Wanderer: A gothic romance in the style of Radcliffe, rich with deceit, murder, and family tragedy.
47: William Beckford's Caliph Vathek: Caliph Vathek is an orientalist tale dealing with ambition, greed, pacts with the devil, and the consequences of unbridled desire.
Center Bottom Right:
25: Arthur Machen's The Great God Pan: The Great God Pan deals with a metaphysical and moral transgression meddling with higher forces that create a corrupted, daemonic femme fatale.
6: Charles Brockden Brown's Wieland: Wieland deals with madness, religious fanaticism, psychological turmoil, and supernatural gruesome violence (early American transposition of the genre after Godwin).
74: Mathew Lewis' The Monk: The Monk details the corruption of a monk by a demon in female disguise as well as romance, temptation, innocence, sexuality, and strong horror elements.
18: Mrs. Carver The Horrors Of Oakendale Abbey: haunted abbey, sensationalist, necromancy, body snatchers, death and decay, grotesque — very popular in its day.
Dead Center:
35: Eliza Fenwick's Secrecy: Female Friendship, hidden pregnancies and betrayal, societal constraints, and punishments.
2: Frances Burney's Camilla: Camilla is a generational tale of coming to age, romance, comedy, and morals with comedic and gothic episodes - immensely popular in its time, endorsed by Jane Austin.
9: Sophia Lee's The Recess, Or A Tale Of Other Times: political intrigue, heroic female lead, seafaring, warfare, gothic castles, marriage — very popular.
10: Charlotte Smith's Emmeline, Or The Orphan Of The Castle: A Cinderella story of female emancipation, gaining property and standing, where ownership over masonry and bodily autonomy blend. Gothic in style - highly popular and finacially successful.
27: the anonymous Count Roderic's Castle: Love, birthrights, barbarous, violent rulers, and ominous castles with vengeful specters and dungeons.
37: Ann Radcliffe's The Castles Of Athlin And Dunbayne: The Castles of Athlin and Dunbayne deals with murder, revenge, royalty, sublime nature, excited feelings, gloomy castles, romance, and heroic women.
Center Left:
12 + 8: Charles Brockden Brown's Arthur Mervyn and Edgar Huntly by the same author: Arthur Mervin details yellow fever plagues, crime, theft, plantation work, prostitution, and a morally gray protagonist. Edgar Huntly deals with wilderness anxiety, supernatural, darkness, and fear, sleepwalking, and veiled truths. Both are considered influential early American gothic texts.
4: Ann Radcliffe's The Romance Of The Forest: Adding a discussion or moral values to a tale of horror and suspense, natural sublime and beauty, sexuality. (By contemporaries regarded it as her best)
5: Horace Walpole's The Castle of Otranto: Medieval haunted Castle, supernatural horrific happenings, violence, surprising humor, sexual and grappling with questions of identity (retrospectively one of the earliest and most formative exemplary of the genre)
15: Thomas Leland's Longsword, Earl Of Salisbury: Gloomy settings, evil clergymen, historical basis, abductions, royalty, shipwrecks. (an early cornerstone of the genre)
7: Eleanor Sleath's The Orphan Of The Rhine: picturesque, strongly radcliffeque, scoundrels and half ruined castles, strong emotions, romance, terror (praised by Jane Austen in Northanger Abbey)
21: Ann Radcliffe's The Mysteries Of Udolpho: Forced marriage, foreboding castles, fear, and scheming relatives, heroic idealistic women.
Center Top Right:
1: Regina Maria Roche's The Children Of The Abbey: The Children Of The Abbey was one of the best-selling novels of the 19th century and deals with wicked relatives, a quest for inheritance, love, royalty, castles, and heightened emotions, full of languishing.
3: William Godwin's Caleb Williams; Or, Things as They Are: A tale of treachery, persecution, political oppression, corrupt hierarchies, psychological obsession, and a big influence on Frankenstein. Individualism and the constraints of institutions.
11: Tobias Smollett's The Adventures Of Ferdinand Count Fathom: Chronicles the travels of a villainous, deceitful dandy with a supernatural undercurrent.
26: Eliza Parsons' The Castle Of Wolfenbach: The castle of Wolfenbach is an important early piece predating many Radcliff texts, lauded by Jane Austin as essential. (early outlier) A gothic royal romance with abundant frenzied expressions of emotion, fainting, weeping, and struggles of identity formation.
14: James Hogg's The Private Memoirs And Confessions Of A Justified Sinner: Almost unnoticed until the 20th centuary, carrying deep religious criticism in its anti-hero.
16: Clara Reeve's The Old English Baron: An homage to Walpoles Castle of Otranto, but intended as a more realistic and streamlined Gothic template — widely adopted. Horror, mystery, ghost stories on castles.
Adding these groupings to the dataframe and visualizing the unique attributes of the groupings:¶
textual_grouping = {'Upper Outliers': [22, 69, 24, 44, 39, 40, 41],
'Upper Half': [38, 29, 32, 77],
'Upper Half Lower': [42, 36, 53, 58],
'Lower Outliers': [19, 28, 52, 33, 34],
'Bottom Half Up': [13, 45, 31, 23, 30],
'Bottom Half Upper': [20, 17, 47],
'Center Bottom Right': [35, 2, 25, 6, 74, 18],
'Dead Center': [35, 2, 9, 10, 27, 37],
'Center Left': [12, 8, 4, 5, 15, 7, 21],
'Center Top Right': [1, 3, 11, 26, 14, 16]}
# Convert sorted_legend_items to a dictionary for easier lookup
legend_items_dict = {item[0]: item[1] for item in sorted_legend_items}
# Use the texts_dict provided to reverse map from number to category
num_to_category = {}
for category, numbers in textual_grouping.items():
for number in numbers:
num_to_category[number] = category
# Function to find category based on text_key
def find_category(text_key):
# Lookup number using text_key from legend_items_dict
number = legend_items_dict.get(text_key)
if number is not None:
# Lookup category using number from num_to_category
return num_to_category.get(number)
return None
# Apply the function to create a new column 'network cluster'
df_net['network_cluster'] = df_net['text_key'].apply(lambda x: find_category(x))
Taking a break here, so that the network visualization and all its individual steps do not need to be rerun for the purpose of the following comparison.
# df_net.to_csv('./analysis/df_net.csv', index=False)
df_net=pd.read_csv('./analysis/df_net.csv')
# First, we define the topic columns again and calculate the means
topic_columns = [col for col in df_net.columns if col.startswith('Topic ')]
# Group by network_cluster and calculate the sum for each topic within each cluster
cluster_topic_mean = df_net.groupby('network_cluster')[topic_columns].mean()
# For each cluster, identify the top 15 topics based on their sum scores
top_topics_per_cluster = cluster_topic_mean.apply(lambda x: x.nlargest(15).index.tolist(), axis=1)
# Prepare the data for visualization
# Flatten the cluster_topic_sums to have a row per topic per cluster, reset index to make cluster a column
cluster_topic_means_flat = cluster_topic_mean.stack().reset_index()
cluster_topic_means_flat.columns = ['network_cluster', 'Topic', 'Mean Score']
# Sort each cluster's topics by 'Mean Score' descending before plotting
cluster_topic_means_flat.sort_values(by=['network_cluster', 'Mean Score'], ascending=[True, False], inplace=True)
# Filter rows where the Topic is one of the top topics for its cluster
cluster_topic_means_flat = cluster_topic_means_flat[cluster_topic_means_flat.apply(lambda row: row['Topic'] in top_topics_per_cluster[row['network_cluster']], axis=1)]
cluster_topic_means_flat['Topic Label'] = cluster_topic_means_flat['Topic'].apply(lambda x: f'{x}: {topic_labels.get(x, "Label not found")}')
# Now we plot each cluster's data in separate subplots with their own y-axis entries.
fig, axes = plt.subplots(nrows=len(top_topics_per_cluster), ncols=1, figsize=(10, 5 * len(top_topics_per_cluster)))
# Add a single heading before the first graph
fig.suptitle('Top 15 Topics per Cluster', fontsize=16, y=1.0025)
# If there's only one cluster, axes may not be an array, so we convert it to one
if len(top_topics_per_cluster) == 1:
axes = [axes]
# Plotting each cluster's data
for ax, (cluster, topics) in zip(axes.flat, top_topics_per_cluster.items()):
# Filter the dataframe for the cluster and its top 15 topics
df_cluster = cluster_topic_means_flat[(cluster_topic_means_flat['network_cluster'] == cluster) & (cluster_topic_means_flat['Topic'].isin(topics))]
sns.barplot(data=df_cluster, x='Mean Score', y='Topic Label', ax=ax, palette='deep', hue='Topic Label', legend=False)
ax.set_title(cluster)
ax.set_xlabel('Sum Score')
ax.set_ylabel('')
plt.tight_layout()
plt.show()
Upper Outliers: Topic focus: Emotional Turmoil, Invasion of the sacred, societal discontent
Upper Half: Topic focus: Clamor, Grief, social and emotional upheaval, Identity
Upper Half Lower: Topic focus: Entirely different make-up, human desires, yearning and health, atmospheric settings, fantastical elements
Lower Outliers (similar to Upper Outliers): Topic focus: Very high Societal Disenfranchisement, Punishment
Bottom Half Up: Topic focus: Domestic and Social Conflicts, Nobility, Conflict, Intimacy
Bottom Half Upper: Topic focus: Battle, Death, Aggression, Conflict, Resistance against demands and Mystery
Center Bottom Right: Topic focus: Invasion of the sacred, Religion, Death and Ferocity
Dead Center: Topic focus: Very High Occurrence of Emotional and Interpersonal Conflicts, Gothic Settings, Quests for identity
Center Left: Topic focus: Social and Emotional Distress, more interpersonal intimacy and interactions, Gothic settings, supernatural impressions and violence
Center Top Right: Topic focus: Royalty, Institutions, Tragedy, Obsessions, Individualism and Revolt
Evaluation Variant¶
For the purpose of evaluating the plausibility of these findings, additional features available in the corpus have been added to the graph, of the 60% of the texts stemming from the Colors Corpus, almost all of them provide information on the role of the texts while two thirds offer information on the period. In order to keep from distorting the initial interpretation based on features inherent to the texts themselves, the graph was duplicated instead, and this version used solely for retroactive evaluation.
# Calculate the cumulative weight for edges where multiple connections exist
for u, v, data in G.edges(data=True):
# Since it's a directed graph, we need to check both directions
if G.has_edge(v, u):
total_weight = data['weight'] + G[v][u]['weight']
G[u][v]['weight'] = total_weight
G[v][u]['weight'] = total_weight
# Storing these measures as node attributes for later use
for node, centrality in degree_centrality.items():
G.nodes[node]['degree_centrality'] = centrality
partition = community_louvain.best_partition(G.to_undirected())
for node, comm_id in partition.items():
G.nodes[node]['community'] = comm_id
# Define a color/shape mapping for roles
color_map = {
'Influence': 'mediumseagreen',
'Central': 'firebrick',
'Peripheral': 'gold',
'Undefined': 'royalblue' # Default color for nodes without a role
}
shape_map = {
'Pre-Romantic': 's', # square
'Romantic': '^', # triangle
'Victorian': 'D', # diamond
'Edwardian': 'p', # pentagon
'Undefined': 'o' # circle for undefined periods
}
node_colors = []
node_shapes = []
node_colors = []
node_shapes = []
for node in G.nodes():
role = df_net[df_net['text_key'] == node]['role'].iloc[0] if 'role' in df_net.columns and not pd.isnull(df_net[df_net['text_key'] == node]['role'].iloc[0]) else 'Undefined'
node_colors.append(color_map[role])
period = df_net[df_net['text_key'] == node]['period'].iloc[0] if 'period' in df_net.columns and not pd.isnull(df_net[df_net['text_key'] == node]['period'].iloc[0]) else 'Undefined'
node_shapes.append(shape_map[period])
centrality = nx.degree_centrality(G)
# Sort nodes by centrality (more central nodes get lower numbers)
sorted_nodes = sorted(G.nodes, key=lambda node: centrality[node], reverse=True)
# Assign numbers to nodes based on sorted order
numbered_labels = {node: i+1 for i, node in enumerate(sorted_nodes)}
# Use the spring layout for visualization
pos = nx.spring_layout(G, k=0.25, iterations=20, seed=42)
# Visualization
plt.figure(figsize=(20, 20))
nx.draw_networkx_edges(G, pos, alpha=0.2)
for i, node in enumerate(G.nodes()):
nx.draw_networkx_nodes(G, pos, nodelist=[node], node_size=5000 * degree_centrality[node], node_color=node_colors[i], node_shape=node_shapes[i])
# Create a two-column legend
sorted_legend_items = sorted(numbered_labels.items(), key=lambda item: item[1])
items_per_column = len(sorted_legend_items) // 2
# Initialize empty strings for each column of the legend
left_column_text = ""
right_column_text = ""
# Populate the column strings with year of publication
for index, (node, number) in enumerate(sorted_legend_items):
# Extract the date for this text_key
date = df_net[df_net['text_key'] == node]['date'].iloc[0]
entry = f"{number}: {node} ({date})\n"
if index < items_per_column:
left_column_text += entry
else:
right_column_text += entry
plt.subplots_adjust(left=0.2, right=0.8)
# Place the column strings on the plot
plt.figtext(0.02, 0.5, left_column_text, ha="left", fontsize=8, bbox={"facecolor":"orange", "alpha":0.5, "pad":5}, va='center')
plt.figtext(0.98, 0.5, right_column_text, ha="right", fontsize=8, bbox={"facecolor":"orange", "alpha":0.5, "pad":5}, va='center')
# Adjust label positions to avoid overlap with nodes
labels_pos = {node: (pos[node][0], pos[node][1] + 0.04) for node in G.nodes()} # Shift labels slightly above nodes
# Assuming 'ax' is your axis object. If you don't have it, get the current active one
ax = plt.gca()
# Legends for Roles and Periods using Line2D
role_legend_handles = [mpatches.Patch(color=color, label=label) for label, color in color_map.items()]
period_legend_handles = [mlines.Line2D([], [], color='black', marker=shape, linestyle='None', markersize=10, label=label) for label, shape in shape_map.items()]
# Role legend
role_legend = plt.legend(handles=role_legend_handles, title="Text Roles", loc='lower left', bbox_to_anchor=(0, -0.1), fancybox=True, shadow=True, ncol=4)
plt.gca().add_artist(role_legend)
# Period legend
plt.legend(handles=period_legend_handles, title="Text Periods", loc='lower right', bbox_to_anchor=(1, -0.1), fancybox=True, shadow=True, ncol=5)
# Draw labels and use adjust_text to improve their placement
texts = []
for node, label_pos in labels_pos.items():
text = plt.text(label_pos[0], label_pos[1], str(numbered_labels[node]), ha='center', va='center', fontsize=8)
texts.append(text)
adjust_text(texts,autoalign='y',avoid_points=False,precision=0.01,
)
plt.title('Cosine Similarity with their predecessors - Influence on Gothic Fiction Texts', fontsize=24)
plt.axis('off')
plt.show()
textual_grouping = {'Upper Outliers': [22, 69, 24, 44, 39, 40, 41],
'Upper Half': [38, 29, 32, 77],
'Upper Half Lower': [42, 36, 53, 58],
'Lower Outliers': [19, 28, 52, 33, 34],
'Bottom Half Up': [13, 45, 31, 23, 30],
'Bottom Half Upper': [20, 17, 47],
'Center Bottom Right': [35, 2, 25, 6, 74, 18],
'Dead Center': [35, 2, 9, 10, 27, 37],
'Center Left': [12, 8, 4, 5, 15, 7, 21],
'Center Top Right': [1, 3, 11, 26, 14, 16]}
Sadly, the complexity of the network and the number of elements displayed seem to have caused 4 nodes to not get rendered in the evaluation version, despite being present in the interpretation version. Despite multiple attemts, this seems irreparable. Given that all of their attributes are known and the network left groves where those nodes should have been, they will be inserted manually in the publication version. The nodes in question are 30, 35, 15, 11.
Contextual organization of the motifs on each cluster:
Upper Outliers: -> Psychological Terror, Politics, Fanaticism, Transgressive Indulgence and Moral Debasement (Topic focus: Emotional Turmoil, Invasion of the sacred, societal discontent)
Eval: 39, 69, 24 (Poe, M. Shelly, Lewis - High Impact Texts, stemming from the center points either the first or second wave of high Gothic productivity.
Upper Half: -> Societal Disarray and Alienation, Humorous Subversion of Expectations (Topic focus: Clamor, Grief, social and emotional upheaval, Identity)
Eval: Three quite early Romantic texts, one of which is a popular favorite of its time, plus one early one by M. Shelly. All of them either regarded as influential or important to the genre. Of particular interest is the nigh on central Eaton Barrett's The Heroine, a text praised for picking up and popularizing the style and content of Radcliff'ian gothic fiction, but in a satirical and quixotically meta referential and critical fashion. Highly praised by other authors of the genre and wildly successful for a wider audience.
Upper Half Lower: -> Life, Death, Pursuit of Happiness (Topic focus: Entirely different make-up, human desires, yearning and health, atmospheric settings, fantastical elements)
Eval: All of them are by Hawthorne whose role within the genre was seemingly not recognized as such by the creators of either of the corpora made use of here, Given that none of the other graphs allotted him any space, this can be seconded here, even if his texts fall quite squarely within the second wave of increased production within the corpus.
Lower Outliers (similar to Upper Outliers): -> Systemic violence, Supernatural Othering Encounters (Topic focus: Very high Societal Disenfranchisement, Punishment)
Eval: Of these texts, only Blackwood's The Willows (33) and Clark's For the term of his natural life (52) are recognized for their influence on the genre. Especially the former is often regarded as one of the finest pieces of supernatural fiction and an important predecessor of the Weird Fiction surge within the 20th century. The text seems well-connected with the lower sections of the center, in particular Arthur Machen and Godwin's St. Leon. Both of which share an ominous, foreboding and anxious attitude.
Bottom Half Up: -> Fallen Nobility, Emotional Distress and Struggles of Love, Satire (Topic focus: Domestic and Social Conflicts, Nobility, Conflict, Intimacy)
Eval: 13 - Godwin (St Leon - central, Faustian bargains, fallen nobility, madness), 45 - M. Shelley (Lodore - Peripheral - feminist, egalitarian novel), 23 - Jane Austin's Northanger Abbey is famed for its delicate parody of the genre, while remaining a prime example itself, in between first and second. 30 - Radcliffe (Sicilian Romance, Central Romantic, first spike), very classic in its themes.
Bottom Half Upper: -> Dark Desires and Dire Consequences, Pacts (Topic focus: Battle, Death, Aggression, Conflict, Resistance against demands and Mystery)
Eval: All three Romantic and very early texts. Shelley and Beckford are regarded as central, Reeve as Peripheral (Percey Shelley's Zastrozzi, a a piece of cruel self-indulgence, revenge and amorality, influential for its iconic villain.) Beckford's Caliph Vathek, early Orientalist adaptation of Walpole's seminal text, highly successful and esteemed for amorality, devil's pacts and powerful use of architecture. It managed to combine the elements of earlier Gothic texts with influences from the Arabian Nights, which had a strong influence on the English Romantic movement. The mysterious Wanderer was a popular book at its time, reaching a large audience, which has ceded in recognition since then, having gotten largely forgotten.
Center Bottom Right: -> Very grotesque, physical depictions in religiously coded settings, preceded by moral transgressions (Body Horror) (less female representation) (Topic focus: Invasion of the sacred, Religion, Death and Ferocity)
Eval: 25 The Great God Pan - Central, Romantic, first peak, early, hyping pageant rites, Christian Symbolism, scientific amoral crossing of boundaries. Influential for Charles Brockden, H P Lovecraft, Oscar Wilde and Bram Stoker, drawing heavily from Poe, Le Fanu and Stevenson. Heavy influence on 20th century weird fiction. Brown's Wieland (central) is a gruesome early American piece, the first of its kind, also heavy on Christian symbolism, madness and supernaturla elements. 74 - Lewis' The Monk is surprisingly unconnected and peripheral for its (central) position. (Because of its unique topical composition: 70, 63, 51 (highest), 38, 34, 30, 9, 4, 5) It still shares 70, 5, 38, making it highly in line with the essential texts, but its other make-up falls out of line - heavily emphasizing persecution, treacherous company, murder and temptation by devils - very fitting and to the point but potentially out there due to the corrupting happening to the protagonists and some mixture of transgression and clerical elements.) 18 - The Horrors Of Oakendale Abbey Romantic, Only Peripheral, very gruesome and grotesque, early, highly popular and influential, but not regarded as highly in itself because of its lowly themes.
Dead Center: -> Female Protagonists, struggling for control, their place in a changing environment or love (Gothic Romance) (strongest female representation) (Topic focus: Very High Occurrence of Emotional and Interpersonal Conflicts, Gothic Settings, Quests for identity)
Eval: 35 Fenwick's Secrecy (Peripheral, Romantic) Early Influence on Radcliffe' The Italian. Morbid, feminist, transgression, female independence. 2. Burney's Camilla, Immensely popular, generational coming of age of several women, comedic with gothic episodes, praised by Jane Austin. 9: Sophia Lee's The Recess, Or A Tale Of Other Times: political intrigue, heroic female lead, seafaring, warfare, gothic castles, marriage — very popular. Very early, romantic and central. A 10: Charlotte Smith's Emmeline , Or The Orphan Of The Castle: (romantic, central, early) A Cinderella story of female emancipation, gaining property and standing, where ownership over masonry and bodily autonomy blend. Gothic in style - highly popular and finacially successful. (37) As well as a Radcliffe text in between. A number of very popular texts, featuring heroic women, romance and terror, some of which are today largely overlooked, hold sway here.
Center Left: -> Core foundational early British texts and their first adaptations over sea. The latter grapple with disease and the wilderness, while the former establish core themes of heightened emotions, insanity, the sublime, ghosts and castles (Topic focus: Social and Emotional Distress, more interpersonal intimacy and interactions, Gothic settings, supernatural impressions and violence).
Eval: All regarded as central early texts of the genre, from two sides of the continent. Brown draws more heavily from Godwin than the others, creating a distinctive voice of wilderness, social disarray and abandonment. His texts Arthur Mervyn and Edgar Huntly defined the American branch of the genre as early adaptations, highly influential. With Wieland (Center Bottom Right, more carnal) and Edgar Huntly as more philosophical more closely tied to the tradition, Arthur Mervyn is further out, sprawling, less successful, less popular at its time, and darker (yellow fever). The British texts carry two by Radcliffe, as well as Eleanor Sleath's The Orphan Of The Rhine: picturesque, strongly radcliffeque, scoundrels and half ruined castles, strong emotions, romance, terror (praised by Jane Austen in Northanger Abbey), praised as very similar in style. Hugely successful. As well as Horace Walpole's The Castle of Otranto: Medieval haunted Castle, supernatural horrific happenings, violence, surprising humor, sexual and grappling with questions of identity (retrospectively one of the earliest and most formative exemplary of the genre)
The American branch has a distinctly more individualist, more Godwin'ian and less medieval branch. While many of the British ones deal with love, sexuality, royalty and a lot of historical fiction. All of them are very early, almost all of them Romantic and all Central.
Center Top Right: -> These texts deal with unrest within power structures, whether through family intrigue, religious power struggles or a quest of individuals against corrupt institutions (Topic focus: Royalty, Institutions, Tragedy, Obsessions, Individualism and Revolt)
Eval: Tobias Smollett's The Adventures Of Ferdinand Count Fathom the only Pre-Romantic and non-central, but only Influential. Resembling Zastrozzi in many ways. 26: Eliza Parsons' The Castle Of Wolfenbach: The castle of Wolfenbach is an important early piece predating many Radcliff texts, lauded by Jane Austin as essential. (early outlier) A gothic royal romance with abundant frenzied expressions of emotion, fainting, weeping, and struggles of identity formation. 16: Clara Reeve's The Old English Baron: An homage to Walpoles Castle of Otranto, but intended as a more realistic and streamlined Gothic template — widely adopted. Horror, mystery, ghost stories on castles. 1: Regina Maria Roche's The Children Of The Abbey: The Children Of The Abbey was one of the best-selling novels of the 19th century and deals with wicked relatives, a quest for inheritance, love, royalty, castles, and heightened emotions, full of languishing. Three very popular texts, continuing the trend from the Center Left and Dead Center - highly popular, either Gothic Romances in the style of Radcliffe or preceding it, or sharing in Walpole's gruesome, sexual and supernatural style. With the Nr. The Children Of The Abbey presenting a best seller of its time further embracing these Radcliffesque motifs. Outliers from this picture, but underlying the more philosophical side of the genre are 3: William Godwin's Caleb Williams; Or, Things as They Are: A tale of treachery, persecution, political oppression, corrupt hierarchies, psychological obsession, and a big influence on Frankenstein. Individualism and the constraints of institutions and 14: James Hogg's The Private Memoirs And Confessions Of A Justified Sinner: It deals with a caste haunted by a young woman wrongfully accused of murder, which deal with the corruption and moral degradation of institutions, both within gruesome settings, while the former attempts to actively rally for a cause from the perspective of the downtrodden, the latter was almost unnoticed until the 20th century, carrying deep religious criticism in its anti-hero.
Assessment: Generally, the relevance within the network is very strongly connected to Romantic authors, almost all the texts central here are regarded as central, with some peripheral ones included.) A strong female presence, intermixing of popular and traditional gothic authors, with some texts of more philosophical inquiry mixed in. Many popular female ones that expanded the genre and were created to live of off, added and expanded the genre creatively, shaping the picture for future texts to come. (Research the financial success of the main contributors.) The trajectory still warrants further analysis. Inter-cluster evaluation of station and positioning of authors still needs to be investigated. The core topics have a share of more than 50% female authors. (Within the female gothic and the gruesome pulp variant, more space for popular fiction to hold sway and influence the style more strongly)
While these texts are regarded as inciting and influential for the genre. Those most distinct for the topics at hand are sometimes others. See analysis above.
'''
Generally speaking, the topics can be categorized in a set of main groups:
-Emotional turmoil and psychological distress
-Physical violence and combat
-Social settings, diplomacy, and court
-Self-expression and frustration with society
-Myth, lore and tales
-Forbidden truths and knowledge
-Adventure and exploration
-Ambition, greed, and regality
-Deceit and apprehension
-Science and reasoning
-Nature - woods, mountains and harbors
-Religion and sacred rituals
-Monsters, demons and undead
-Medieval settings, cities, and castles
-Dreams and illusions
'''
# Calculate centrality measures
degree_centrality = nx.degree_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000)
top_10_degree = sorted(degree_centrality, key=degree_centrality.get, reverse=True)[:10]
top_10_betweenness = sorted(betweenness_centrality, key=betweenness_centrality.get, reverse=True)[:10]
top_10_closeness = sorted(closeness_centrality, key=closeness_centrality.get, reverse=True)[:10]
top_10_eigenvector = sorted(eigenvector_centrality, key=eigenvector_centrality.get, reverse=True)[:10]
top_10_metrics = {
"Degree Centrality": top_10_degree,
"Betweenness Centrality": top_10_betweenness,
"Closeness Centrality": top_10_closeness,
"Eigenvector Centrality": top_10_eigenvector
}
top_10_metrics
{'Degree Centrality': ['Roche_TheChildre',
'Burney_CamillaOrA',
'Godwin_CalebWilli',
'Radcliffe_TheRomance',
'Walpole_TheCastleO',
'Brown_WielandOrT',
'Sleath_TheOrphanO',
'Brown_EdgarHuntl',
'Lee_TheRecessO',
'Smith_EmmelineOr'],
'Betweenness Centrality': ['Burney_CamillaOrA',
'Roche_TheChildre',
'Hogg_ThePrivate',
'Godwin_CalebWilli',
'Brown_EdgarHuntl',
'Collins_TheWomanin',
'Radcliffe_TheRomance',
'Parsons_TheCastleO',
'Sleath_TheOrphanO',
'Brown_ArthurMerv'],
'Closeness Centrality': ['Lytton_Falkland',
'James_GhostStori',
'Hogg_ThePrivate',
'Blackwood_TheWillows',
'Machen_TheGreatGo',
'Burton_Vikramandt',
'Collins_TheWomanin',
'Brown_ArthurMerv',
'James_AThinGhost',
'Brown_EdgarHuntl'],
'Eigenvector Centrality': ['Hawthorne_TheWhiteOl',
'Hawthorne_LittleAnni',
'Hawthorne_TheLilysQu',
'Marsh_TheBeetleA',
'LeFanu_UncleSilas',
'Hawthorne_TwiceToldT',
'Blackwood_TheWillows',
'James_AThinGhost',
'Hodgson_TheHouseOn',
'James_GhostStori']}
Contextual Comparison¶
While the comparisons on the level of influential topics and authorial contribution to a given topic put more emphasis on the uniqueness of certain voices towards a specific theme and association put a lot of focus on authors with a succinctly unique voice that carried a wide reach, such as Poe, Le Fanu, Stoker and Hawthorne, The network analysis showed that while their contributions were influential and formative for a certain aesthetic, they did not invite the same kind of imitation and formation of a movement as such as the likes of Walpole, Leiland, Brown, Sleathe. In the topical part of the analysis, Walpole voice was not represented in the topical part of the analysis, while his influence carried exceedingly far on the level of textual similarity at the start of a movement. Influential and formative, yet not distinct in the same manner. The influential, exceptional voices which combined both analyses are Godwin, Mary Shelly and to a lesser degree Ann Radcliffe.
Appendix¶
In order to compare some base structures and evaluate the constancy of topic distributions across different variants of topic modeling, a few select methods from above have also been applied to the CTM and ETM model
CTM¶
The following interactive visualization is only properly displayed in the html version or when run locally.
prepared_data = pyLDAvis.prepare(topic_term_dists_CTM, doc_topic_dists_CTM, doc_lengths, vocab, term_frequency)
pyLDAvis.display(prepared_data)
The following interactive visualization is only properly displayed in the html version or when run locally.
df_CTM= df_txt_features_CTM.copy()
app = dash.Dash(__name__)
# Function to convert year to decade for grouping
def year_to_decade(year):
return (year // 10) * 10
# Applying the function to create a 'decade' column
df_CTM['decade'] = df_CTM['date'].apply(year_to_decade)
# Extracting topic columns
topic_columns_CTM= [col for col in df_CTM.columns if col.startswith('Topic')]
# Grouping by 'decade' and calculating the mean for topic distributions
decade_grouped_CTM= df_CTM.groupby('decade')[topic_columns_CTM].mean()
# Calculating the standard deviation for each topic to measure fluctuations
topic_fluctuations = decade_grouped_CTM.std()
# Function to filter topics based on a fluctuation percentile threshold
def filter_topics_by_percentile(threshold_percentile):
percentile_threshold = np.percentile(topic_fluctuations, threshold_percentile)
return topic_fluctuations[topic_fluctuations > percentile_threshold].index.tolist()
# Function to update the figure based on selected topics
def create_figure(selected_topics):
fig = go.Figure()
for topic in selected_topics:
fig.add_trace(go.Scatter(x=decade_grouped_CTM.index, y=decade_grouped_CTM[topic],
mode='lines', name=topic))
fig.update_layout(legend_orientation="h", legend=dict(x=0, y=1.1, xanchor='left'))
return fig
# Function to update the figure based on selected topics
def create_figure(selected_topics):
fig = go.Figure()
for topic in selected_topics:
fig.add_trace(go.Scatter(x=decade_grouped_CTM.index, y=decade_grouped_CTM[topic],
mode='lines', name=topic))
fig.update_layout(legend_orientation="h", legend=dict(x=0, y=1.1, xanchor='left'))
return fig
# Create slider
slider = dcc.Slider(
id='percentile-slider',
min=0,
max=100,
value=90,
marks={i: f'{i}%' for i in range(0, 101, 25)},
step=1
)
# Create dropdown (initially empty)
dropdown = dcc.Dropdown(
id='topic-dropdown',
options=[],
value=[],
multi=True
)
# App layout
app.layout = html.Div([
html.Div([slider]),
html.Div([dropdown]),
dcc.Graph(id='topic-graph')
])
# Callback for updating the dropdown options and selected values based on slider value
@app.callback(
[Output('topic-dropdown', 'options'),
Output('topic-dropdown', 'value')],
[Input('percentile-slider', 'value')]
)
def update_dropdown_options(percentile_value):
filtered_topics = filter_topics_by_percentile(percentile_value)
options = [{'label': topic, 'value': topic} for topic in filtered_topics]
return options, [option['value'] for option in options]
# Callback for updating the graph based on selected topics and percentile
@app.callback(
Output('topic-graph', 'figure'),
[Input('topic-dropdown', 'value'),
Input('percentile-slider', 'value')]
)
def update_graph(selected_topics, percentile_value):
return create_figure(selected_topics)
# Run the app
if __name__ == '__main__':
app.run_server(debug=True)
df_CTM_clu = df_txt_features_CTM.copy()
# Selecting only the topic distribution columns for clustering
topic_columns_CTM= [col for col in df_CTM_clu.columns if col.startswith('Topic')]
topic_data = df_CTM_clu[topic_columns_CTM]
# Using PCA for dimensionality reduction
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(topic_data)
# Applying K-means clustering
kmeans = KMeans(n_clusters=5) # Choosing 5 clusters arbitrarily, can be tuned
kmeans.fit(reduced_data)
labels = kmeans.predict(reduced_data)
df_CTM_clu['cluster'] = labels
# Plotting the results
plt.figure(figsize=(12, 8))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=labels, cmap='viridis', marker='o')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=300, alpha=0.6)
plt.title('PCA-reduced Topic Data with K-means Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()
/Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:753: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. /Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:591: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. /Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:600: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
ETM¶
The following interactive visualization is only properly displayed in the html version or when run locally.
prepared_data = pyLDAvis.prepare(topic_term_dists_ETM, doc_topic_dists_ETM, doc_lengths, vocab, term_frequency)
pyLDAvis.display(prepared_data)
The following interactive visualization is only properly displayed in the html version or when run locally.
df_time_ETM=df_txt_features_ETM.copy()
topic_columns_ETM= [col for col in df_time_ETM.columns if col.startswith('Topic')]
topic_data = df_time_ETM[topic_columns_ETM]
# Function to convert year to decade
def year_to_decade(year):
return (year // 10) * 10
# Applying the function to create a 'decade' column
df_time_ETM['decade'] = df_time_ETM['date'].apply(year_to_decade)
# Grouping by 'decade' and calculating the mean for topic distributions
decade_grouped = df_time_ETM.groupby('decade')[topic_columns_ETM].mean()
plt.figure(figsize=(20, 8)) # Keeping the graph broad
for topic in topic_columns_ETM:
plt.plot(decade_grouped.index, decade_grouped[topic], label=topic)
plt.xlabel('Decade')
plt.ylabel('Topic Distribution')
plt.title('Adjusted Topic Trends Over Decades ETM')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=10) # Spreading out the legend further with fewer rows
plt.show()
plt.figure(figsize=(20, 8)) # Keeping the graph broad
for topic in fluctuating_topics:
plt.plot(decade_grouped.index, decade_grouped[topic], label=topic)
plt.xlabel('Decade')
plt.ylabel('Topic Distribution')
plt.title('Adjusted Topic Trends Over Decades')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=10) # Spreading out the legend further with fewer rows
plt.show()
df_ETM_clu = df_txt_features_ETM.copy()
# Selecting only the topic distribution columns for clustering
topic_columns_ETM= [col for col in df_ETM_clu.columns if col.startswith('Topic')]
topic_data = df_ETM_clu[topic_columns_ETM]
# Using PCA for dimensionality reduction
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(topic_data)
# Applying K-means clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(reduced_data)
labels = kmeans.predict(reduced_data)
df_ETM_clu['cluster'] = labels
# Plotting the results
plt.figure(figsize=(12, 8))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=labels, cmap='viridis', marker='o')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=300, alpha=0.6)
plt.title('PCA-reduced Topic Data with K-means Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()
/Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:753: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. /Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:591: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead. /Storage/Studium/DigitalHumanities/Semester5/Thesis/code_notebooks/.venv/lib/python3.9/site-packages/sklearn/utils/validation.py:600: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
Evaluation of the alternative model's suitability for these purposes, as well as the plausibility of the LDA distribution¶
The Contextual Topic Model (CTM) seems to be unsuitable to the Shannon divergence based multidimensional scaling employed by pyLDAvis, the topics seem coherent and the overall distribution throughout time roughly analogous to that of the LDA used in the analysis, strengthening the impression that the topic patterns and their distribution is static and driven both by the distribution of publications throughout the decades, and shifts in relevant themes.
The principal component analysis based on its topic distribution shows a more evenly spread out shape, but a roughly analogous amount of spread, strengthening the overall picture that both of them present in their plausibility.
The results of the topic modeling in embedding space (ETM) on the other hand seem uninterpretable and fundamental, at odds with those of the other two. While both CTM and ETM are not interpretable with the same multidimensional scaling techniques and both provide meaningful topics, the features of the ETM do not provide any meaningful spread throughout time or across texts, leading to both an uninterpretable historical distribution, and principal component analysis.